Updated on 2024-01-15
In this article, we will use my Visual FA solution to explore finite automata concepts used for finding patterns in text.
This is yet another attempt at Visual FA. My first attempt suffered from performance problems that were hidden by flaws in the benchmark code. To fix the issue required a total rewrite, everything from the runtime logic to the generator and compiler. Let's take a moment to feel sorry for me. =) Or shame me, as is your preference. Whatever works.
In any case, there is a lot of ground to cover so I'm breaking this into a multipart series and starting from the beginning, even if it means we're covering territory in previous articles.
We'll be using Visual FA to do everything here, so all of the article's content is with an eye toward this engine. When comparing other engines, such as Microsoft's that will be noted.
Article list
In short, Visual FA is basically a regex engine. That's simplifying it a lot and it will probably just lead to more questions. The long answer is it is a finite automata simulator, code generator and compiler for matching patterns in Unicode streams.
True, Microsoft offers a snazzy backtracking engine that can compile and generate code. What it can't do is tokenize text, at least not without some serious compromises, hacks, and performance penalties. It also trades away speed for features. What sets Visual FA apart is that it is a minimalist engine with no backtracking, anchors, or any other fluff that may get in the way of raw performance, plus it has lexing/tokenizing capabilities.
So primarily, if you need to lex/tokenize, you'll need to use an engine other than Microsoft's. Visual FA fits the bill. Secondarily, if you don't need fancy features, and you want a bit more performance, this library is also a viable candidate, being up to 3 times faster in some cases.
And finally, Microsoft's engine can't show you all of its guts, and generate pretty graphs. All of the images in this article were generated using Visual FA, with the help of Graphviz.
This article is primarily an exploration of concepts peppered with a quick start guide on using Visual FA to do fundamental operations.
You don't need to be a regex expert to enjoy this article, but having some fundamentals down is really helpful. Visual FA's syntax is minimalist and relatively simple. Furthermore, we won't cover everything Visual FA supports here, because it's not entirely necessary.
Regular Operators:
ABC
- matches "ABC"|
: ABC|DEF
- matches either "ABC" or "DEF"[A-C]
- matches either "A", "B" or "C"[^A-C]
- matches any character other than "A", "B" or "C"?
: A?
- matches either "A" or ""*
: A*
matches "A" or "" or "AAAAAAAA"+
: A+
matches "A" or "AAAAAAAA"{,}
: A{2,}
matches "AA" or "AAA" or "AAAAAAA"()
- precedence overriding (aka (?:)
): (ABC)?
- Matches "ABC". Note that Visual FA does not sub-capture.With Visual FA, you can parse an expression into an Abstract Syntax Tree (AST) or directly into a state machine:
using VisualFA;
// parse into a state machine directly
FA fa = FA.Parse("foo|bar");
// parse into an AST for visitation
// and later transformation into a state
// machine using .ToFA()
RegexExpression exp = RegexExpression.Parse("foo|bar");
fa = exp.ToFA();
Actually, that's a bit of an overstatement of scope - we'll be looking at two specific kinds of finite automata used for matching patterns in a flat stream (series of elements, but no hierarchies of elements): NFAs and DFAs.
These stand for Non-deterministic Finite Automata and Deterministic Finite Automata respectively, and they are two kinds of finite state machine.
The terminology is playing a little loose there, as these two types are more specific types of state machines than their names would otherwise imply. All finite automata machines - that is, all state machines are either deterministic or non-deterministic, even including push down automata and other types of state machines we won't be exploring here. That said, when you see NFA and DFA in the mathematical sense, you can usually assume we mean the two types of machines we'll be covering here.
If instead, you see NFA and DFA being used to describe a category of regular expression engine, that's a little bit different, although just to be difficult, they come from the same origins. Visual FA can handily deal with NFA state machines, but its regular expression engine is strictly the DFA type. Microsoft's Regex engine is by contrast, an NFA type. DFA regular expression engines typically build NFA machines out of expressions and then transform those into DFA machines to do the actual matching. Visual FA is no exception. It does however, complicate things because it can run a match using an NFA machine without first converting it to a DFA. Even in this case though, it is essentially acting on that machine as though it will be turned into a DFA. Microsoft's engine by contrast, has things on their NFA states that cannot be transformed into a DFA. Visual FA does not. In effect, Visual FA's NFAs are just future DFAs in waiting. If you don't understand all of this yet that's fine, but maybe revisit it later if you start to question the terminology used as you read further.
The set of all states reachable from a state directly or indirectly via transitions - including itself, is known as the closure of a state.
This is important because in Visual FA, your "handle" for the entire state machine is always the root state (known as q0 canonically but you can call it what you like). The closure of q0 is always the entire machine. Review this material after you look through the very next section, because it has pretty pictures, and they really help here.
Essentially, you can build or parse a state machine into existence, and what you get from one of those builder or parse functions is the root of what it built:
FA myMachine = FA.Parse("foo|bar");
Here, myMachine is actually q0, which we'll see below, under the very next section.
myMachine has a total of 6 states, including itself.
List<FA> closure = new List<FA>();
// fill the above list with 6 states
// first will be q0.
myMachine.FillClosure(closure);
// prints "myMachine has 6 states."
Console.WriteLine("myMachine has {0} states.",myMachine.Count);
Debug.Assert(myMachine==closure[0]); // always true
That's kind of longhand way to perform a closure. There's a shorter way to call it:
IList<FA> closure = myMachine.FillClosure();
The first form allows you to recycle list instances which is important for doing set heavy operations. That's why Visual FA uses FillXXXX() methods instead of always returning new lists. FillClosure() is expected to be called frequently, so it is orchestrated accordingly.
We'll start with DFAs since they are the simplest type of machine to understand. All finite state machines are directed graphs of states with transitions to other states on particular inputs:
This is a DFA state machine matching the text "foo" or the text "bar" (foo|bar).
We always start at q0.
We can only move in the direction indicated by the arrows.
On each state q, we examine the current codepoint* under the cursor, and the list of outgoing arrows. If there is an arrow with the input codepoint on it, we must follow that arrow and advance the cursor by moving to the next codepoint.
If there are no matching outgoing transitions from 3, then we must evaluate the current state that we're on now. If it's a double circled state, that means we can return the input we've gathered so far as a match. If it's not a double circled state, then there was no match.
codepoint indicates the UTF-32 Unicode codepoint under the cursor that we're examining. For most situations, this is effectively your current character. The exception is in the case of UTF-16 surrogate characters, which will be resolved to a single UTF-32 codepoint for evaluation in the machine. We don't deal with surrogates at this level, so there's nothing to complicate it.
You can use the FSM Explorer application included with this solution to produce and visually step through a regular expression with some input text.
Typically, every DFA starts out life as an NFA*. The reason is that NFAs are far easier to construct programmatically through parsing or composition of several machines (we'll cover that). We don't keep them as NFAs because NFA graphs are more complicated than DFA graphs for a computer to traverse.
The primary difference between an NFA and a DFA is that an NFA adds the ability to be in more than one state at once.
This can happen in one of two ways:
Here is effectively the same machine as the above, but represented as an expanded NFA.
The gray dashed arrows are epsilon transitions as explained in #2 from the above list. Note that you always follow every epsilon transition you encounter immediately. That means as soon as you are in q0, you are also in q1 and q6 as well.
Here are the rules for walking NFAs. They are nearly the same as the DFA rules, just accounting for epsilon transitions and multiple states:
A state in a state machine is represented by the VisualFA.FA instance, regardless of whether the machine is a DFA or an NFA.
Let's revisit this machine in code:
using VisualFA;
// some of this is recap from above
// we called this myMachine above:
// difference now is we're parsing it
// into expanded form so we can have epsilon transitions
FA q0 = FA.Parse("foo|bar",0,false);
// collect all states reachable from q0
// on any transition including itself.
// q0 will be the first entry in the list
IList<FA> closure = q0.FillClosure();
// closure.Count is 10
// get the epsilon closure. Note that we typically
// do this on a set of states, except in the case of q0
// on our initial navigation. For q0 we need to make an
// array out of it.
IList<FA> epsilonClosure = FA.FillEpsilonClosure(new FA[] { q0 });
// epsilonClosure is { q0, q1, q6 }
// There is no instance overload for a single state
// for this operation because it makes conflicts
// in terms of determining whether to call the instance
// or static methods
Thompson's construction algorithm is a well known (among FA nerds, that is) algorithm for converting regular expression operators into NFAs. The idea is that each operation has a well known construction for building it, and then you compose those together like Legos to build your final expression.
Thompson's construction algorithm
Let's start with arguably, the simplest - a literal - ABC: (This is technically 3 literals concatenated in regex parlance, but it's nice for reasons to treat a literal string as a first class entity.)
FA ABC = FA.Literal("ABC");
And for our second more or less "atomic" construct, a set [ABC]:
FA AthroughC = FA.Set(new FARange[] {new FARange('A','C')});
Now for simple concatenation, ABC(DEF):
FA ABC = FA.Literal("ABC");
FA DEF = FA.Literal("DEF");
FA ABC_DEF = FA.Concat(new FA[] { ABC, DEF });
Next, a little alternation ABC|DEF:
FA ABC = FA.Literal("ABC");
FA DEF = FA.Literal("DEF");
FA ABCorDEF = FA.Or(new FA[] { ABC, DEF });
Next, an optional match (ABC)?:
FA ABC = FA.Literal("ABC");
FA ABCopt = FA.Optional(ABC);
Finally some repetition/looping: (ABC)+:
FA ABC = FA.Literal("ABC");
FA ABCloop = FA.Repeat(ABC,1,0);
We can compose everything by treating every expression as its own machine, and then composing those machines together. The above graphs are fundamental machines that all other machines can be built out of. Such is the advantage of using NFAs at this stage. DFAs cannot be readily composed.
Speaking of which:
Powerset construction, or subset construction depending on who you ask is basically the process of creating a new DFA that matches the same input as a given NFA. It has been proven mathematically that this is possible for any NFA that follows the rules in this article. The trick is finding it. The primary idea is each and every combination of states you can possibly be in in an NFA represents a single state in the new DFA, with its transitions merged into that single state.
This doesn't always work, because with an NFA, you can transition to more than one state on any given input. What the powerset construction does to account for that, is bubble the conflict further down the path - such that it tries to disambiguate further down the line.
Here's a pair of figures that we'll go over. This is a machine that determines whether the input is a line comment or the beginning of a block comment:
The top machine is an NFA in compact form. The bottom machine is the equivalent DFA (not minimized - q4 could potentially be eliminated and rolled into q3).
The bottom graph has { } sets below each state label. These sets are the set of states from the NFA that yielded the state in the DFA. For example, q0 in the DFA was simply q0 in the NFA. However, transitioning from q0 on / in the NFA brings us to two different states, q1 and q3 both. q1 in our DFA reflects that.
Notice how the DFA transformation bubbled the disjunction/divergence on / further down the path, from q0 to q1 in the DFA on * vs. / instead. That's what I was trying to get at earlier about dealing with ambiguities. The construction algorithm assumes that if you traverse far enough eventually, there will be some kind of differentiating input. When that's not the case, one expression blindly pushes the other aside and effectively shadows it.
// machines start out in compact NFA form by default
// either through parsing or thompson construction
FA nfa = FA.Parse(@"(\/\/[^\r\n]*)|(\/\*)");
// convert to a DFA (no minimization)
FA dfa = nfa.ToDfa();
// or (slower, makes more efficient machines)
FA min_dfa = nfa.ToMinimizedDfa();
// minimization removes duplicate states
// in the above machine q4 would be removed.
I've mentioned lexing and tokenization more than once. To tokenize a document means to take the text therein and break it into a series of tokens, each token representing a position, the value that was matched, and a symbolic matched expression (or an error symbol if no expression was matched). This is often used by parsers to parse structured text.
Since Microsoft's parser can't do this, you may not be familiar with that usage for regular expressions, but it is as bread and butter for regex as matching text in strings is.
In order to work, your state machine needs to be composed of multiple expressions. Each expression has an associated symbol it is tagged with. Visual FA calls this the accept symbol. As you retrieve matches over inputs, a match is reported with the associated accept symbol for the expression that matched the text at the reported position. Every portion of the document is returned, including sections that weren't matched. They will be reported with an accept symbol of -1.
All of this engine's searching facilities report all of the text - even unmatched portions. That makes it significantly different than Microsoft's offering, and usable in more scenarios. If you want it to work like Microsoft's, you simply need to ignore FAMatch instances whose IsSuccess property is false or whose SymbolId property is less than zero.
Here's an example of what a lexer might look like (presented in DFA form):
Note the special dashed transition outgoing from q10. This is not an epsilon transition - those are gray and have no label on them - this is a lazy match simulation transition to a "block end". We'll cover block ends and the need for them next.
DFA machines cannot lazy match. Well, there is one person that claims to have figured it out, and has provided an implementation, but I fear it explodes the state table size if I understand it properly, and I'm not sure that I do. In any case, DFA machines always consume input until they have nowhere left they can move, even if a shorter string would have yielded a valid match. This is called greedy matching.
has provided an implementation
Sometimes it's not what you need. In particular, those times tend to be cases where you need to match multicharacter ending conditions to arbitrary text, like finding the end of HTML/SGML/XML comments and CDATA sections, or finding the end of C style block /* */ comments as we've done. To do so requires lazy matching. It requires telling the engine to stop being so greedy! However, we cannot communicate that to the engine without breaking the math of the NFA to DFA transformation.
Here's what we do instead: As alluded to before, each expression has a non-negative integer accept symbol associated with it.
Each accept symbol meanwhile potentially has another expression/state machine associated with it known as a "block end". If it has one, then when the initial expression is matched, the matcher will keep consuming input up until and including the point where the block end expression is matched.
Despite only allowing one per expression, and only allowing for accepting any and all characters through the final match, this still covers probably 80% of the use case for lazy matching without requiring a fundamental change in how any of the mechanics of the engine work.
With Visual FA, you specify your block ends at the time of execution for the runtime engine, at compile time when using the compiled matchers, and at generation time when using generated matchers. They are always specified as an array where the index into the array corresponds to the accept symbol associated with that block end. The array can be null, and does not need to be the entire size necessary to hold the maximum accept symbol - only the maximum accept symbol with a block end attached to it.
// accept sym zero is default but specified here for clarity
FA commentBlock = FA.Parse(@"\/\*", 0); // accept sym of zero
FA commentBlockEnd = FA.Parse(@"\*\/");
foreach(FAMatch match in commentBlock.Run("foo /*bar*/ baz",
new FA[] { commentBlockEnd}))
{
Console.WriteLine("{0}:{1} at {2}",
match.SymbolId,
match.Value,
match.Position);
}
This will print the following to the console:
-1:foo at 0
0:/*bar*/ at 4
-1: baz at 11
As mentioned, the runners will always return the entire text. If you only want the bits that matched, check each FAMatch instance for IsSuccess.
Let's walk through the execution of a DFA as though it was C# code:
int cp; // the current codepoint under the cursor
int len; // the current match capture length
int p; // the current position
int l; // the current line
int c; // the current column
cp = -1;
len = 0;
if ((this.position == -1)) {
// first move
this.position = 0;
}
// mark the current location
p = this.position;
l = this.line;
c = this.column;
// load cp with the current available
// codepoint, but do not capture
this.Advance(s, ref cp, ref len, true);
q0:
// if it's a '/'
if ((cp == 47)) {
// move and capture
this.Advance(s, ref cp, ref len, false);
goto q1;
}
// not accepting
goto errorout;
q1:
// if it's a '*'
if ((cp == 42)) {
this.Advance(s, ref cp, ref len, false);
goto q2;
}
// not accepting
goto errorout;
q2:
// accepted /* - now process block end
// for symbol zero
return _BlockEnd0(s, cp, len, p, l, c);
errorout:
// error handling. keep capturing the bad data until
// it looks like we might get good data.
if (((cp == -1) // eof?
|| (cp == 47))) { // q0 transitions
if ((len == 0)) { // didn't capture, so end of input
return FAMatch.Create(-2, null, 0, 0, 0);
}
return FAMatch.Create(-1,
s.Slice(p, len).ToString(),
p,
l,
c);
}
this.Advance(s, ref cp, ref len, false);
goto errorout;
The basic idea is to always have a current input you're looking at. This might require a corner case on the first input depending on what your source input API looks like. Once you do:
In the block end, things are slightly simpler:
q0:
if ((cp == 42)) { // is it a '*'?
// capture and then advance the input
this.Advance(s, ref cp, ref len, false);
goto q1;
}
// not accepting
goto errorout;
q1:
if ((cp == 47)) { // is it a '/'?
this.Advance(s, ref cp, ref len, false);
goto q2;
}
// not accepting
goto errorout;
q2:
// accepting
// since this is a block end, we return everything
// we captured since we were called.
return FAMatch.Create(0,
s.Slice(position, len).ToString(),
position,
line,
column);
errorout:
// if there's no more input, return what we have
// as an error
if ((cp == -1)) {
return FAMatch.Create(-1,
s.Slice(position, len).ToString(),
position,
line,
column);
}
this.Advance(s, ref cp, ref len, false);
goto q0;
Next, we will cover the API mechanics in more detail, as well as how that works, followed by compilation and code generation. You can find it here.
Enjoy.