Cfg: Jumpstart Your Parser Generator Projects

Updated on 2020-01-31

A small library for providing baseline Context-Free Grammar computations

Introduction

Most major parser generators operate off of some variant or another of the Context-Free Grammar theory of parsing, which is a rigorous mathematical model designed to describe Chomsky Type-2 languages. This library provides basic support for querying symbols and computing first/follows/predicts for a grammar for k=1.

Chomsky Type-2 languages

Basically, most parser generators need the same basic services: A way to keep track of rules and symbols, and a way to compute FIRSTS/FOLLOWS/PREDICT sets. That's what this little library does.

It's highly specialized, but if you're building a parser generator, you could do worse than starting with this library.

Conceptualizing this Mess

First of all, a note about collections in my APIs. I don't use "getter" properties for collections. I use "Fill" methods which are more flexible in that they can return a new collection filled with items or fill an existing collection. You'll see various FillXXXX() methods in my code, and this is what they are. Calling them like...

var syms = cfg.FillSymbols();

...is the equivalent of using a getter property.

CfgDocument is the main CFG object that basically represents a grammar, while CfgRule represents a single rule. All symbols are represented as simple strings. The CfgDocument.Rules property holds a collection of rules that make up the grammar. Once these are filled, you can perform various computations on them.

After that, coding it is pretty straightforward, and I've included an extensive demo project which includes a runtime parser powered by this lil beast.

Coding this Mess

I prefer to let the demo code do the talking.

static void Main(string[] args)
{
    if(0==args.Length)
    {
        Console.Error.WriteLine("Must specify input CFG");
        return;
    }
    var cfg = CfgDocument.ReadFrom(args[0]);

    // not-necessary but faster access since we're not modifying:

    cfg.RebuildCache();
    Console.WriteLine("See: http://hackingoff.com/compilers/ll-1-parser-generator");
    Console.WriteLine();
    Console.WriteLine("CFG has {0} rules composed of {1} non-terminals and
                       {2} terminals for a total of {3} symbols" ,cfg.Rules.Count,
                       cfg.FillNonTerminals().Count, cfg.FillTerminals().Count,
                       cfg.FillSymbols().Count);
    Console.WriteLine();

    Console.Write("Terminals:");
    foreach(var t in cfg.FillTerminals())
    {
        Console.Write(" ");
        Console.Write(t);
    }
    Console.WriteLine();
    Console.WriteLine();

    // compute the various aspects of the CFG
    var predict = cfg.FillPredict();
    // var firsts = cfg.FillFirsts(); // we don't need this because we have predict
    var follows = cfg.FillFollows();

    // enum some stuff
    foreach(var nt in cfg.FillNonTerminals())
    {
        Console.WriteLine(nt+" has the following rules:");
        foreach(var ntr in cfg.FillNonTerminalRules(nt))
        {
            Console.Write("\t");
            Console.WriteLine(ntr);
        }
        Console.WriteLine();
        Console.WriteLine( nt + " has the following PREDICT:");
        foreach (var t in predict[nt])
        {
            Console.Write("\t");
            Console.WriteLine((t.Symbol??"<empty>")+" - "+t.Rule);
        }
        Console.WriteLine();
        // PREDICT makes this redundant
        //Console.WriteLine(nt + " has the following FIRSTS:");
        //foreach (var t in firsts[nt])
        //{
        //    Console.Write("\t");
        //    Console.WriteLine(t);
        //}
        //Console.WriteLine();
        Console.WriteLine(nt + " has the following FOLLOWS:");
        foreach (var t in follows[nt])
        {
            Console.Write("\t");
            Console.WriteLine(t);
        }
        Console.WriteLine();
    }

    // now let's parse some stuff

    Console.WriteLine("Building simple parse table");

    // the parse table is simply nested dictionaries where each outer key is a non-terminal
    // and the inner key is each terminal, where they map to a single rule.
    // lookups during parse are basically rule=parseTable[<topOfStack>][<currentToken>]
    var parseTable = new Dictionary<string, Dictionary<string, CfgRule>>();
    foreach (var nt in cfg.FillNonTerminals())
    {
        var d = new Dictionary<string, CfgRule>();
        parseTable.Add(nt, d);
        foreach(var p in predict[nt])
        {
            if(null!=p.Symbol)
            {
                CfgRule or;
                if(d.TryGetValue(p.Symbol,out or))
                {
                    Console.Error.WriteLine("First-first conflict between " +
                                             p.Rule + " and " + or);
                } else
                    d.Add(p.Symbol, p.Rule);
            } else
            {
                foreach(var f in follows[nt])
                {
                    CfgRule or;
                    if (d.TryGetValue(f, out or))
                    {
                        Console.Error.WriteLine("First-follows conflict between " +
                                                 p.Rule + " and " + or);
                    }
                    else
                        d.Add(f, p.Rule);
                }
            }
        }
    }

    #region Build a Lexer for our parser - out of scope of the CFG project but necessary
    Console.WriteLine("Building simple lexer");
    var fas = new FA[]
    {
        FA.Literal("+","add"),
        FA.Literal("*","mul"),
        FA.Literal("(","lparen"),
        FA.Literal(")","rparen"),
        FA.Repeat(FA.Set("0123456789"), "int")
    };

    var lexer = new FA();
    for(var i = 0;i<fas.Length;i++)
        lexer.EpsilonTransitions.Add(fas[i]);
    Console.WriteLine();
    #endregion

    var text = "(1+3)*2";

    Console.WriteLine("Creating tokenizer");
    var tokenizer = new Tokenizer(lexer, text);
    Console.WriteLine("Creating parser");
    var parser = new Parser(parseTable, tokenizer, "Expr");
    Console.WriteLine();
    Console.WriteLine("Parsing " + text);
    Console.WriteLine(parser.ParseSubtree());
}

And there's your runthrough. Obviously if you're building a parser-generator, you'll be using the PREDICT/FOLLOWS sets to build that parse table in code, or even rendering recursive descent parsers with it like Parsley does.

Parsley

Other Resources

History

  • 31st January, 2020 - Initial submission