Updated on 2019-03-31
A Non-Backtracking Regular Expression Engine for .NET (Core)
Why even make a regular expression engine when .NET includes a rich regular expressions library?
First of all, there are two major classes of regular expression engine - backtracking and non-backtracking. Most modern regular expression engines do backtracking. Backtracking engines have greater expressive power than non-backtracking engines, but at the cost of performance.
Why make a non-backtracking engine?
Well, for starters, they out-perform the backtracking engines on short strings over and over. Secondly, they are more suitable for operating directly on a streaming source, whereas backtracking regex engines generally require the text to be preloaded into memory.
More generally, the style of regular expression engine provided by .NET is geared toward matching patterns in long strings. It's not as suitable for lexing tokens out of a stream. Lexing is of primary importance to parsers, and so the regular expressions used for lexing terminals are often implemented using a non-backtracking engine.
Lexing is the process of breaking an input stream of characters into "tokens" - strings of characters that have a "symbol" associated with them. The symbol indicates what type of string it is. For example, the string "124.35" might be reported as the symbol "NUMBER" whereas the string "foo" might be reported as the symbol "IDENTIFIER".
Parsers typically use lexers underneath, and then compose the incoming symbols into a syntax tree. Because lexers are called in core parsing code, the lexing operations must be reasonably efficient. The .NET regular expression isn't really suitable here, and while it can work, it actually increases code complexity while diminishing performance.
Included in this project is a file called "FA.cs" that contains the core code for a regular expression engine using finite automata which resolves to the humble yet ubiquitous state machine.
Finite state machines are composed of graphs of states. Each state can reference another state using either an "input transition" or an "epsilon transition". An input transition consumes the specified input before moving to the destination. An epsilon transition moves to a destination state without consuming or examining any input. A machine with one or more epsilon transitions is considered "non-deterministic" (NFA) and may be in more than one state at any given time. Non-deterministic state machines are more complicated and less efficient, but fortunately, there is a deterministic state machine (DFA) for any non-deterministic state machine, and an algorithm to turn an NFA into a DFA using a technique called powerset construction - math that is out of the scope here to explore.
The closure of a state is the state itself and the unique set of all states reachable from that state, on either an epsilon or input transition. The closure represents the state machine indicated by the root state. These graphs can be recursive.
The graph for a state machine matching "foo" is just below:
The closure of q0 is { q0, q1, q2, q3 } because each of those states are connected to q0 either directly or indirectly.
Each time a black arrow is crossed, the input must match the character(s) above the arrow to proceed along that path. Once a double circle (q3) is encountered, the machine "accepts" the input as valid/matching and returns the token under the state name, in this case "foo" is under q3 so it is what will be returned once the input is matched.
Now let's match an identifier of the form [A-Z_a-z][0-9A-Z_a-z]*
As you can see, looping is fairly straightforward.
Now let's get useful. Lexers (aka tokenizers) distinguish between one of many different patterns of input. Let's build one that represents an identifier, integer or a space.
You'll note that there are two ways to match an int (q2, q4), but they each return the same symbol "int" as a result.
Each of the machines presented was deterministic (each is a DFA). A DFA machine will only ever be in one state at a time. There are also non-deterministic machines, or NFAs, which can be in several states at the same time!
You'll note this machine has gray dashed lines in it. These are epsilon transitions or transitions on epsilon. Epsilon simply means "no input" in this context. Every time a machine encounters a dashed arrow, it is automatically in the state pointed to by that arrow as well as itself. It can be said that it transitions on no input.
Starting out in q0 also puts you in q1, q4, q5, q7, q8 and q11 as well! This set of states, that is, the set of states reachable from this state on epsilon transitions is called the epsilon closure.
These NFA machines are easier to compose/build out of several other machines, but harder to run because they're more complex.
As mentioned, there's a DFA machine for every NFA machine, and an algorithm for converting an NFA to a DFA.
This code allows you to painlessly create these machines, to make pretty graphs like the above, (with the help of Graphviz), to generate super fast code to run a given state machine, to run a state machine at runtime without generating the code first, to translate regex expressions to machines and back again, and to otherwise examine and compose the machines.
Below is included in the sample project, and heavily commented to show you the basics of using this library:
//
// See the included project for a full sample
//
var lexer = FA.Lexer(
// our id from the article
FA.Parse(@"[A-Z_a-z][0-9A-Z_a-z]*", "id"),
// our int from the article
FA.Parse(@"0|([\-]?[1-9][0-9]*)", "int"),
// our space from the article
FA.Parse(@" ", "space")
);
// clean up any purely pass through states in our graph - not required
lexer.TrimNeutrals();
// transform the lexer into its DFA equivalent - not required
var dfaLexer = lexer.ToDfa();
// make sure there are no duplicate states left after the transformation.
// this can be an expensive op and isn't strictly necessary, so it's a
// separate step. It *should* be done before code generation.
// If code generation is done on an NFA, it is automatically transformed to
// a DFA and duplicates are removed during that process
dfaLexer.TrimDuplicates(); // not required - but recommended on any DFA
// prior to code generation
Console.WriteLine("Rendering graphs");
try
{
// try to render the graphs (requires GraphViz - specifically dot.exe)
// to change the format, simply use a different extension like PNG or SVG
// pass additional options using FA.DotGraphOptions if desired
lexer.RenderToFile(@"..\..\..\lexer_nfa.jpg");
dfaLexer.RenderToFile(@"..\..\..\lexer_dfa.jpg");
}
catch
{
Console.WriteLine("GraphViz is not installed.");
}
// generate the code (can generate from either an NFA or DFA)
// if using an NFA, it will first be transformed to a DFA
// and then any duplicates will be trimmed from the result
// generate code direct to C#
// we can use the CodeDOM here if CODEDOM compile constant is defined
// but to keep things simple, we won't here.
Console.WriteLine("Generating C# Code");
using (var tw = new StreamWriter(@"..\..\..\Generated.cs"))
{
tw.WriteLine("namespace {0}", typeof(Program).Namespace);
tw.WriteLine("{");
tw.WriteLine("partial class Program {");
// generate TryLexValue
dfaLexer.WriteCSharpTryLexMethodTo(tw, "Value");
tw.WriteLine();
// generate LexValue
dfaLexer.WriteCSharpLexMethodTo(tw, "Value");
tw.WriteLine();
tw.WriteLine("}");
tw.WriteLine("}");
}
// our test string
var test = "foo 456 bar 123 baz";
Console.WriteLine("Runtime Lexing:");
// Create a ParseContext over the string
// See https://www.codeproject.com/Articles/1280230/Easier-Hand-Rolled-Parsers
// ParseContext handles the input Cursor, the Position, Line and Column counting
// as well as error handing. It operates over an enumeration of characters, a string
// or a TextReader. The main members are Advance(), Current, Expecting(), Position
// Line, Column, Capture and CaptureCurrent()
// It's easy enough to add code to generate lexers that do not require this class
// but we would lose error handling and position tracking, so it's not implemented.
// to use other interfaces, wrap them with ParseContext.Create()
var pc = ParseContext.Create(test);
pc.EnsureStarted(); // make sure we're on the first character.
try
{
// ParseContext.Current works like a TextReader.Peek() does, returning an int
// without advancing the cursor. Unline Peek() it is not problematic.
while (-1 != pc.Current) // loop until end of input.
{
// runtime lexing (can be NFA or DFA - DFA is intrinsically more efficient)
// if we pass it a StringBuilder, it will reuse it - perf is better,
// but we don't bother here.
var tok = dfaLexer.Lex(pc); // lex the next token
// symbol is in tok.Key, value is in tok.Value
Console.WriteLine("Token: {0}, Value: {1}", tok.Key, tok.Value);
}
}
catch(ExpectingException eex)
{
// We got a lexing error, so report it. The message might be long, but you can always
// build your own using ExpectingException members.
Console.WriteLine("Error: {0}", eex.Message);
}
Console.WriteLine();
Console.WriteLine("Compiled Lexing:");
pc = ParseContext.Create(test); // restart the cursor
pc.EnsureStarted();
try
{
// reusing a stringbuilder allows the lex routine to run a little faster
// don't use it for anything else if you're passing it to lex.
var sb = new StringBuilder(); // don't have to use this, can pass null
while (-1 != pc.Current) // loop until end of input.
{
// compiled lexing (always DFA - most efficient option)
// Generated.cs must be present and valid
var tok = LexValue(pc,sb); // lex the next token
Console.WriteLine("Token: {0}, Value: {1}", tok.Key, tok.Value);
}
}
catch (ExpectingException eex)
{
Console.WriteLine("Error: {0}", eex.Message);
}
Let's take a look at the generated code now:
q1:
if((pc.Current>='0'&& pc.Current<='9')||
(pc.Current>='A'&& pc.Current<='Z')||
(pc.Current=='_')||
(pc.Current>='a'&& pc.Current<='z')) {
sb.Append((char)pc.Current);
pc.Advance();
goto q1;
}
return new System.Collections.Generic.KeyValuePair<System.String,string>("id",sb.ToString());
As you can see, this is state q1, which corresponds to the last part of our "id" token in the graph above.
You can see the output transitions from this state encoded into the if.
You can see the arrow pointing back to itself using the goto.
You can see that if the if condition isn't satisfied, it ends with a successful result that returns a token.
All of these properties are reflected in the graph above if you look for them.
Generating regular expressions - textual representation - from an FA is non-trivial. Parsing them from a textual representation is trivial. This is almost exactly the opposite of how syntax trees usually are in terms of coding them.
The generated DFAs are an order of magnitude faster in release builds than debug builds. I still don't know why. Regardless, they're still generally the fastest option in any build configuration.
The table based lexer can be faster than the pure generated lexer for large state machines, but it takes a bit more memory. The generated lexer is fast for smaller lexers. After more than 100 states it's a good idea to test the performance.
GraphViz Dot is deceptively complicated and expressive.
It took way too many iterations of this design over the years to get one I was happy with that handled general use cases well, including error handling. I finally got it fast, consistent, and easy in the right combinations.