Updated on 2020-01-17
Manage location tracking, lifetime, error reporting and basic parsing/lexing with this simple class
This is a topic I've covered before, but the code was getting rusty, and a bit bloated, so I've decided to streamline, rewrite, and produce an accompanying article.
I write a lot of parsers and lexers and generators for those, and in the course of doing this, I've developed some simple tools to cover common pitfalls in doing so. The class presented here is absolutely foundational to a lot of my code. I use it or one of its variants in many of my projects, including Parsley. It provides a unified facade over disparate input sources, managing lifetime, location information such as line and column info, capturing input, error reporting, and a better cursor than TextReader or IEnumerator
I'm going to recover ground I've covered in my previous article mentioned in the introduction. If you've followed along so far from the previous article, the main changes I've made are removing lookahead to eke a little bit more performance out of the class, and separating some of the helper methods into a separate file, making the core API extremely spartan, while keeping the extended functionality as a separate included file.
There are several pitfalls to lexing and parsing this class covers, despite the API being small, so we'll cover them in sections.
LexContext provides Create() and CreateFrom() which take an IEnumerator
In addition, LexContext provides a Close() method and is disposable so that you can close the underlying file or network stream, if any.
Regardless of how the instance was created, the interface for accessing it is the same.
When writing lexers or lexerless parsers, you often need to keep the current character under the cursor around, and pass it to the various parse or lex functions you've created. I'll demonstrate later, but basically, you'll find yourself using Peek() on TextReader if you don't, and that's Bad(TM) as it prevents your code from working over network streams or other non-seekable sources, like Console input.
IEnumerator
LexContext, like TextReader will report the current cursor value even if the cursor is not over a character. Like IEnumerator
Advance() works like TextReader.Read() in that it advances the cursor and returns the character that was read. After Advance(), that character can be accessed again through the Current property. You can use whichever mechanism for retrieving the current character that suits your code in the moment.
Using these makes the above pitfalls vanish, and in doing so, makes your code to execute a lex/parse almost trivial while preserving the ability to examine non-seekable streams.
LexContext tracks position, line and column info, and the current file or URL where the document resides, if any. This is often used for error reporting, but can be used for whatever it's needed for. Tab width is 4 unless otherwise specified using the TabWidth property. Setting this to the input decide's tab width allows column information to be accurately tracked. For the console, as well as most underlying input sources, the default should be appropriate. The Line and Column properties are one based, while the Position property is zero based. FileOrUrl holds the source location of the current document.
ExpectingException is the exception that is thrown when the lexer/parser encounters an error. It reports error location information, a message, and an optional collection of expected characters or symbols. You can throw it manually, but typically it will be raised as a result of calling Expecting().
While parsing, you call Expecting() to indicate that you expect Current to be one of a series of characters. If Current is not one of those characters, an ExpectingException is automatically filled with the current context information and then thrown.
As you parse or lex, you'll typically need to capture the input you're examining to be retrieved later during the parse/lex, as part of a subparse, or simply to report from your lex or parse functions. LexContext provides a CaptureBuffer to help do just that.
CaptureBuffer itself is a StringBuilder, while LexContext provides Capture() to capture the current character under the cursor, if any, GetCapture() to retrieve all or part of the captured buffer, and ClearCapture() to clear the capture buffer.
If you need to execute a subparse, simply capture what you need to subparse into the capture buffer, and then create another LexContext over the captured string! You can then feed that LexContext to your subparse functions.
I've provided a simply JSON minifier by way of example. It is about half of a JSON parser as well in that it loads the parsed text into dictionaries and lists depending on whether they are JSON objects or JSON arrays. What it doesn't do is normalize scalar values like strings. Strings are returned with quotes and embedded escapes, numbers, bools and nulls are returned as strings.
static void Main(string[] args)
{
// minifies JSON. Does so by parsing into an intermediary graph
// this step wasn't required, but makes it easier to adapt
// the code to a real world JSON parser
// holds our json data
IDictionary<string, object> json = null;
// parse our file
using (var pc = LexContext.CreateFrom(@"..\..\Burn Notice.2919.tv.json"))
json = _ParseJsonObject(pc);
// write our json data out
_WriteJsonTo(json, Console.Out);
}
static object _ParseJson(LexContext pc)
{
// parses a JSON object, array, or value
pc.TrySkipWhiteSpace();
switch (pc.Current)
{
case '{':
return _ParseJsonObject(pc);
case '[':
return _ParseJsonArray(pc);
default:
return _ParseJsonValue(pc);
}
}
static IDictionary<string, object> _ParseJsonObject(LexContext pc)
{
// a JSON {} object - our objects are dictionaries
var result = new Dictionary<string, object>();
pc.TrySkipWhiteSpace();
pc.Expecting('{');
pc.Advance();
pc.Expecting(); // expecting anything other than end of input
while ('}' != pc.Current && -1 != pc.Current) // loop until } or end
{
pc.TrySkipWhiteSpace();
// _ParseJsonValue parses any scalar value, but we only want
// a string so we check here that there's a quote mark to
// ensure the field will be a string.
pc.Expecting('"');
var fn = _ParseJsonValue(pc);
pc.TrySkipWhiteSpace();
pc.Expecting(':');
pc.Advance();
// add the next value to the dictionary
result.Add(fn, _ParseJson(pc));
pc.TrySkipWhiteSpace();
pc.Expecting('}', ',');
// skip commas
if (',' == pc.Current) pc.Advance();
}
// make sure we're positioned on the end
pc.Expecting('}');
// ... and read past it
pc.Advance();
return result;
}
static IList<object> _ParseJsonArray(LexContext pc)
{
// a JSON [] array - our arrays are lists
var result = new List<object>();
pc.TrySkipWhiteSpace();
pc.Expecting('[');
pc.Advance();
pc.Expecting(); // expect anything but end of input
// loop until end of array or input
while (-1 != pc.Current && ']' != pc.Current)
{
pc.TrySkipWhiteSpace();
// add the next item
result.Add(_ParseJson(pc));
pc.TrySkipWhiteSpace();
pc.Expecting(']', ',');
// skip the comma
if (',' == pc.Current) pc.Advance();
}
// ensure we're on the final position
pc.Expecting(']');
// .. and read past it
pc.Advance();
return result;
}
static string _ParseJsonValue(LexContext pc)
{
// parses a scalar JSON value, represented as a string
// strings are returned quotes and all, with escapes
// embedded
pc.TrySkipWhiteSpace();
pc.Expecting(); // expect anything but end of input
pc.ClearCapture();
if ('\"' == pc.Current)
{
pc.Capture();
pc.Advance();
// reads until it finds a quote
// using \ as an escape character
// and consuming the final quote
// at the end
pc.TryReadUntil('\"', '\\', true);
// return what we read
return pc.GetCapture();
}
pc.TryReadUntil(false, ',', '}', ']', ' ', '\t', '\r', '\n', '\v', '\f');
return pc.GetCapture();
}
You can see for starters, that there are API functions like TrySkipWhiteSpace(). Those are in LexContext.BaseExtensions.cs. All these do is drive the LexContext in order to fulfill the requested operation, like skipping over whitespace. TrySkipUntil()/TryReadUntil() are very helpful as they are escape aware. Above, we use TryReadUntil() to capture a string with \ as the escape character.
After we parse the JSON into the appropriate objects, we simply write them to the console, as shown here:
static void _WriteJsonTo(object json, TextWriter writer)
{
var d = json as IDictionary<string, object>;
if (null != d)
_WriteJsonObjectTo(d, writer);
else
{
var l = json as IList<object>;
if (null != l)
_WriteJsonArrayTo(l, writer);
else
writer.Write(json);
}
}
static void _WriteJsonObjectTo(IDictionary<string, object> json, TextWriter writer)
{
var delim = "{";
foreach (var field in json)
{
writer.Write(delim);
_WriteJsonTo(field.Key, writer);
writer.Write(":");
_WriteJsonTo(field.Value, writer);
delim = ",";
}
if ("{" == delim)
writer.Write(delim);
writer.Write("}");
}
static void _WriteJsonArrayTo(IList<object> json, TextWriter writer)
{
var delim = "[";
foreach (var item in json)
{
writer.Write(delim);
_WriteJsonTo(item, writer);
delim = ",";
}
if ("[" == delim)
writer.Write(delim);
writer.Write("]");
}
Note that this is suitable for minification, but it is not a standards compliant JSON parser. In order to keep things simple for the demo, it's a bit "looser" than the JSON spec in that it allows more than the JSON spec does. Namely, it doesn't enforce valid string escapes, accepts more kinds of whitespace, and doesn't enforce termination of a string on a newline. Despite that, as long as the JSON is valid to begin with, this will parse it.