Updated on 2019-08-17
Using PCK's grammar system, and understanding the concepts behind it
This is one of a series of articles on The Parser Construction Kit (PCK). I just call it "Puck".
The Parser Construction Kit (PCK)
Disclaimer: As always, please use GitHub for the code bits, as I'm building on this daily. I always provide the latest-at-the-time build of PCK with every article, but I make significant improvements in between articles. So again, please use the GitHub. I only post the bits here directly to monitor interest and as tradition. I don't go back and update these code bits here at Code Project with fixes and improvements.
This article covers the pck.cfg assembly, but it also covers the concepts behind it, which are much more important.
It's very helpful to have used a parser generator before, especially PCK. Any background you have on parsing can only help here, but I'll try to keep things simple.
First of all, let's rehash some content from an earlier article of mine, since it's vital here:
content from an earlier article
Regardless of underlying parsing algorithm, in order for any parser to work, it must have some kind of grammar. Almost everything we produce for parsing will be generated from this grammar. The grammar describes the "language" we are looking to parse. We're only looking to parse context-free languages so far, so we will need a context-free grammar, or CFG.
context-free languages context-free grammar
PCK handles basic operations in CFGs but to understand them, we'll have to explore how this all works conceptually.
Let's start with a run down of the parsing process. We'll explore this small C fragment.
int main() {
printf("Hello World");
return 0;
}
CFG grammars are the point here but grammars themselves typically don't understand individual characters in the input stream. They work with tokens, which are small text fragments - lexemes essentially, with an associated symbol/label attached to them.
The above might be broken down into these tokens:
keyword: int
(whitespace)
identifier: main
lparen: (
rparen: )
(whitespace)
lbrace: {
(whitespace)
identifier: printf
lparen: (
string: "Hello World"
rparen: )
semi: ;
(whitespace)
keyword: return
(whitespace)
int: 0
semi: ;
(whitespace)
rbrace: }
Note on the left hand side of the colon is the symbol "attached" to the fragment of text on the right hand side of the colon on each line. Note that the whitespace is essentially passed over.
Anyway, the grammar won't care what the value is. It only cares about the left hand side parts - the symbols. This is what the grammar sees:
keyword,identifier,lparen,rparen,lbrace,identifier,
lparen,string,rparen,semi,keyword,int,semi,rbrace
These are called terminal symbols, or terminals. They are leaves of our parse tree. We create these from input text, but that's a separate process, covered by Pck/FA. CFGs don't care about that part. Non-terminals are the nodes in the tree, and they're of central importance to a CFG. We use them to impose a heirarchy on these lists of terminals we get from the input.
At its simplest, a CFG is a series of rules. The rules themselves are composed of elements called symbols which can either be "terminal" or "non-terminal". We'll be using lowercase to represent terminals and uppercase to represent non-terminals but in practice it doesn't matter. This is simply for clarity.
A rule takes the form of:
A -> B c
Here, A is the left hand side, and B c are two symbols (non-terminal and terminal respectively) appearing on the right hand side. A symbol appearing on the left-hand side of any rule is automatically a non-terminal. Since all of ours are uppercase, only uppercase words will appear on the left hand side of any rule.
We can have multiple rules with the same left hand side.
A -> b
A -> c
This is sometimes written as the shorthand:
A -> b | c
While the left hand side is called the Non-terminal, the right hand side is called a derivation of a rule.
When we have multiple derivations with the same left hand non-terminal, it means the rule can be resolved in multiple ways. It's kind of like a choice. Above might be spoken as "A is derived as the terminal b or c."
When we have multiple symbols appearing on the right hand side/derivation as in the first example, that represents a sequence of symbols, such as "A is derived as the non-terminal B followed by the terminal c".
A rule can be empty, or nil, which means it has no right hand side such as:
A ->
This is usually useful for combining it with other A rules in order to make them optional such as:
A -> c
A ->
Which indicates that A may be derived as the terminal c or nothing.
Rules can be recursive, such that the left hand side is referenced within its derivation on the right hand side. This creates a kind of "loop" in the grammar, which is useful for describing repeating constructs, such as argument lists in function calls.
A -> c A
The only issue with this is that a certain family of parsing algorithms - that is, LL, including LL(1) - are limited such that a rule cannot be left recursive or it will create an infinite loop, and so the following is not allowed:
A -> A c
There are workarounds for this which you must do in order for the grammar to be valid for certain parsers like LL parsers. Massaging the grammar to work with a parser is known as factoring. There are even ways to do it programatically which are beyond the scope of this article. However, even with factoring and left recursion elimination, not all grammars can be parsed with all algorithms.
The series of rules taken together make up the grammar, and define an ovararching outline/heirarchy for our language.
The structure and form of the non-terminals are defined by the rules. The terminals are not yet defined - only referenced in the rules. Defining the terminals is covered by Pck/FA, but an overview of the concept is covered in a former article of mine here.
The grammar need only refer to symbols by their handle. It needs no other direct information about them other than what is provided by the rules. In our code, we will simply be using strings to represent the symbols, but they can be integers, or guids or whatever your heart desires, as long as the values act as unique handles.
That leads to a pretty simple data structure for a CFG overall.
A rule has a string field for the left hand side, and a list of zero or more strings for the right hand side/derivation.
A grammar meanwhile, is just a list of these rules.
The clever bit comes in what we do with them. Roughly, there are five major steps most parsing algorithms need to perform.
The first thing is to know how to determine which symbols are non-terminals and which are terminals even though they're all just strings. That's pretty easy. Anything that appears on the left hand side of any rule is a non-terminal. All the rest of the symbols referenced by the rules are terminal. Remember that while non-terminals are defined by these rules, the terminals are defined elsewhere. Right now, they're "abstract".
The second thing they need to do is conceptually "augment" the grammar by adding two reserved terminals, "#EOS" (often represented as "$" in other tutorials) which is the end of input stream marker, and "#ERROR" which indicates an unrecognized bit of input in the stream. These have no user supplied definition. They are produced while reading the input as necessary and consumed by the parser.
The third thing they need to do is be able to compute the PREDICT/FIRSTS sets. FIRSTS are the terminals that can appear at the beginning of each non-terminal. If you look at any grammar, you can see the first right hand side symbol (or empty/nil) - symbol. That's what we need. However, if it's a non-terminal, we need to get its FIRSTS and so on. In the end, each non-terminal should have an exhaustive list of the terminals that appear as the first right hand side, and nil if there is an empty/nil rule as shown previously.
The PREDICT sets are the same as the FIRSTS sets except they also have the rule associated with each terminal so you know where it originated. In practice, since the PREDICT sets contain all the information the FIRSTS sets contain, it's best to forgo the computation of the FIRSTS sets directly, and simply compute the PREDICT sets.
The fourth thing they need to do is compute the FOLLOWS sets. The FOLLOWS sets are the terminals that can appear directly after a given non-terminal. This is trickier than computing FIRSTS and PREDICT sets. You have to scan the grammar looking for other rules that reference your non-terminals in order to find what can follow them. We'll be getting into that below with our example.
Finally, the fifth thing they need to do is create the parse table. Those use the information from these tables in order to do so, for every CFG based parsing algorithm I've ever encountered. They may use the information in radically different ways, but they all need these operations.
In this lesson, we will be declaring the classes for a CFG and performing four of these five steps.
Taking what we've oulined above, let's work through it conceptually with a specific example. We will be using the same example as Andrew Begel's excellent work so that can be referred to as well.
Consider the following grammar:
E -> E + T | T
T -> T * F | F
F -> ( E ) | int
or its longhand equivalent:
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( E )
F -> int
This represents a grammar for a simple arithmetic language that supports *, + and ( ) operations.
The non-terminals are E, T, and F.
The terminals are +, *, (, ) and int.
The grammar is also left-recursive, which as mentioned earlier, will not do. We're going to manually refactor it and eliminate the left recursion.
Take a look at the first two rules (shorthand):
E -> E + T | T
Here, for each rule where the non-terminal on the left (E) of the arrow is the same as the first part of the right-hand side of the arrow (E + T), we take the part of the rule without the E (+T) and remove it and add it into its own new rule which we'll call E':
E' -> + T
Now, after each of the new rules, add E' to the end:
E' -> + T E'
Now add an extra rule that is empty/nil:
E' -> + T E'
E' ->
Now we must fix the original E rules. To do so, we take all of the right-hand sides that didn't start with E, and add E' to the end of them:
E -> T E'
If we do the same with T, we get the following grammar:
E -> T E'
E' -> + T E' |
T -> F T'
T' -> * F T' |
F -> ( E ) | int
or longhand:
E -> T E'
E' -> + T E'
E' ->
T -> F T'
T' -> * F T'
T' ->
F -> ( E )
F -> int
Note that we didn't do anything to F because it wasn't left-recursive.
Above is the modified grammar we'll be using and referring to so keep that in mind.
Now we can move on to computing the FIRSTS sets (or more generally, the PREDICT sets, but we'll be looking at FIRSTS in particular)
As before, FIRSTS are the first terminals that can appear for any non-terminal, while a PREDICT is the same but with the rule that each of the FIRSTS came from.
Let's start with an easy one, FIRSTS(F) and PREDICT(F) - the firsts and predict tables for F.
We see F-> ( E ) so we can say that ( is one of the FIRSTS for F, and one of the PREDICT entries is ( F-> ( E ),( )
We also see F-> int so we can say that int is one of the FIRSTS for F and one of the PREDICT entries is ( F-> int,int )
We're done with F, which has two entries. From now on, we'll only list the FIRSTS sets for readability. The PREDICT sets can be implied from the originating rule like above.
FIRSTS(E) = { (, int } FIRSTS(E') = { +, nil } FIRSTS(T) = { (, int } FIRSTS(T') = { *, nil } and finally, the one we already did:
FIRSTS(F) = { (, int }
Good. Now moving on to the FOLLOWS sets:
FOLLOWS() lets us know the terminals that can come after a non-terminal. This is not the final terminal in a non-terminal. It's the ones that can follow it. It's dicey to compute because you have to look at all the rules where a non-terminal can appear - that is, examine all of the derivations that contain the non-terminal in order to figure out what can follow it, and even then, it's not quite that simple. There is a non-obvious case having to do with nil rules.
So now we find every place our non-terminal is located on the right side of any of the arrows, as stated above, looking for anything that follows it. When we find a non-terminal following it, we take the FIRSTS and use those. When we find a terminal following it, we use that. We also have to handle nil/empty rules specially, as well as non-terminals that appear on the righthandmost part of a derivation.
Before we begin, we must virtually augment the grammar with a special starting rule that includes the #EOS (end of stream) terminal after our start symbol. Here, this rule would be:
S -> E #EOS
This is important so that we can tell when we've reached the end of our parse.
Now that we've augmented the grammar for the purposes of computing the FOLLOWS, we can move on.
Let's work out FOLLOWS(E):
Obviously, #EOS is one based on the augmented rule above. Look through all the rules in the grammar and see if E appears anywhere else. It does. It appears under one of the F rules. What follows it is the terminal ), so we add that. We're done with E.
Now let's do FOLLOWS(E'):
This one is trickier. Nothing follows it directly in the grammar. It appears at the end of a couple of rules though, and remember we need to deal with that. For this, we need to examine the rules where it was referenced:
E -> T E'
We can see that it came from E, so if you think about it, what follows E also follows E'.
E' -> + T E'
For this one, it's the same thing, except as Andrew Begel points out, it's a tautology: What follows E' follows E' so we ignore it.
On to FOLLOWS(T):
Easy enough. T is followed by E'. Ergo, whichever terminals begin E' must be the terminals that follow T.
In other words, FOLLOWS(T) = FIRSTS(E')
But wait, there's a nil in there. Remember when I said nil rules need special handling? Well, here we are.
We can't have that nil that came from the FIRSTS set. Since it came from FIRSTS(E'), then whatever FOLLOWS(E') also follows this, because of our optional rule. Trust me, if you do the math, it works out this way. Remember that whatever FIRSTS it came from is whatever FOLLOWS it will have.
Applying those principles, let's do them all:
FOLLOWS(E) = { #EOS, ) } FOLLOWS(E') = { #EOS, ) } FOLLOWS(T) = { +, #EOS, ) } FOLLOWS(T') = { +, #EOS, ) } FOLLOWS(F) = { *, +, #EOS, ) }
We're done with that.
As mentioned, we need these FIRSTS/PREDICT and FOLLOWS results for most parsing algorithms in order to generate a parse table.
Pck/Cfg handles all of this for us.
Pck's CFG rules are specified in the format outlined. Different parser systems may specify their rules in a different format but they all represent the same thing. The format PCK uses is fairly common.
Pck has the "PCK specification" file format which can contain lexer rules, CFG rules, or a combination of both, and attributes that mark up the symbols declared therein with extra information.
This is a simple line based format using the structure outlined in the above context. Lexer rules and attributes are beyond the scope of this article.
Here's the full PCK specification file for the simple expression grammar format we created above:
E:start
E -> E add T
E -> T
T -> T mul F
T -> F
F -> lparen E rparen
F -> int
add='\+'
mul='\*'
lparen='\('
rparen='\)'
int='[0-9]+'
The italicized bit is an attribute. It applies to the nonterminal E, and is named "start". Since it has no "=
All that does is tell the grammar which symbol is the start symbol. If not specified, the first non-terminal in the grammar is used. The italicized and bolded bits are the only parts of this document that are relevant to a CFG. Everything else is ignored. Pck/FA uses the other bit to generate tokenizers.
Normally, these files are machine generated. However, they are central to the functionality of PCK so they should be covered.
First of all, unless you're implementing your own context-free parsing algorithm, you'll not need to use most of this code directly. We'll first cover the common stuff you will use if you are doing things like creating parsers at runtime.
Usually, we want to get a PCK specification to begin with. We can read it from a file or parse one from a string using:
var cfg = CfgDocument.ReadFrom(filename);
or:
var cfg = CfgDocument.Parse(text);
respectively.
Alternatively, we can create them (or modify them) programmatically.
var cfg = new CfgDocument();
// add the first rule entry from our example above
cfg.Rules.Add(new CfgRule("E","T","E'"));
However we get one, the question is what do we do with it?
With the appropriate algorithm brought "online", we can create parsers with it, and this is really common. For example, referencing the pck.ll1 assembly allows for:
var parser = cfg.ToLL1Parser(tokenizer); // we got our tokenizer from elsewhere
We can write it to a string:
Console.WriteLine(cfg); // calls cfg.ToString() implicitly
We can clone it or compare it to another CFG for equality:
var cfg2 = cfg.Clone();
Console.WriteLine(cfg==cfg2); // prints True
We can add, modify, or remove rules:
cfg.Rules[0].Right[0]="F"; // modifies E -> T E' to be E -> F E'
We can enumerate or fill collections with the non-terminal and terminal symbols:
foreach(var nt in cfg.EnumNonTerminals()) Console.WriteLine(nt);
foreach(var t in cfg.EnumTerminals()) Console.WriteLine(t);
foreach(var s in cfg.EnumSymbols()) Console.WriteLine(s);
var ntl = cfg.FillNonTerminals();
var tl = cfg.FillTerminals();
var sl = cfg.FillSymbols();
We can compute the firsts, follows and predict sets:
var firsts = cfg.FillFirsts();
var follows=cfg.FillFollows();
var predict=cfg.FillPredict();
These are used by the various parser algorithms to compute their parse tables, although each uses them differently, or may not use all of them directly.
There's a lot more there, most of which you will probably never use, frankly so we'll skip it. But keep this information in mind as it will make using PCK a bit clearer and provide background for future articles.