Building a Parrot Compiler
Pages: 1, 2, 3
Optimizing the Grammar
Recursive descent parsers do a left-most, depth-first match of the grammar against the input source. While this is a powerful technique, it's also potentially very slow — there's a lot more backtracking with a recursive descent parser than with a LALR parser. Because of this, it's a good idea to optimize your grammars, otherwise you'll find your parser too slow, regardless of the language it's written in, to be useful.
Optimizing recursive descent grammars deserves an article (or more) all by itself, but there are a few easy tricks you can use to get a respectable speedup in parse times. The easiest way to make a recursive descent parser faster is to make it do less work. Since rule failures (that is, rules that don't match) make up a significant amount of the work the parser does, make sure the parser fails as rarely as possible, or, rather, succeeds as often as possible.
For example, if you have a grammar with the following keywords:
abort: abort
else: else
if: if
print: print
read: read
then: then
It's useful to have a grammar element that encompasses them all. As a first try, you'd probably throw something together like:
keyword: abort | else | if | print | read | then
and that will work. It'll also be slower than it has to be.
Why? Well, that's due to the way that recursive descent parsers work. When
the parser is looking to see if the keyword rule matches, it tries each
sub-rule in turn. In this case the rules are listed alphabetically, so it tries
abort first, then else, then if, and so
on. Programming languages are no different than any other language. Some words
are more common than others. In this case, it's likely that abort is the
least common of the words in your program, so you're guaranteed that
the keyword rule will fail the first test most of the time. The rule will still
work, just not as fast as if you had the subrules ordered by frequency
of occurrence. Performing a simple frequency analysis on a representative
sample of what you're parsing will give you the information you need to order
the subrules more efficiently.
Straight frequency sorting works when each subrule costs about the same amount to evaluate. When your subrules are each constant strings they each take about the same amount of time to evaluate. If you have some subrules that are significantly more expensive — because you've got a complex regular expression, or have to execute code to check if a subrule matched — you may find it better to reorder the subrules based both on frequency and expense. Often this is a matter of trial and error, but proper ordering of subrules can cut parsing time by half or more, depending on how many subrules you have and the frequency distribution of your keywords.
Generating the Code
Once you've transformed your source program into a parsed tree, you need to then turn that tree into executable code. In our case, we generate PIR, which is Parrot's Intermediate Representation, a sort of high-level assembly language. While we could have generated the assembly directly, PIR has some features that are nice for compiler writers, including automatic register allocation, subroutine call handling, and subroutine preamble and postamble handling. If you've never had to deal with any of these things count yourself lucky. If you have, you'll be happy you won't have to do so now.
Our example language is pretty simple. We can declare variables, use them in expressions, store numbers and strings in them, and print them out. There are no loops, decisions, blocks, subroutines, or flow control words. While that's not sufficient for a real language, it's enough for example purposes.
Sample Program
Since we need a source program to go with our grammar, let's use the following program:
declare foo
declare bar
foo = 15
bar = (foo + 8) * 32 7
print bar
print \n
print foo % 3
print \n
Once again, nothing at all fancy. The output should be 729 and 5, each on separate lines. Each line is a single statement, so we don't have to worry about statements spanning lines, multiple statements on a line, or other unpleasant complications. This is a good thing, at least for example purposes.
Source CodeDownload the example file, compiler.pl. |
The basic parser and compiler is in Listing 1. It integrates the grammar we detailed earlier with the code necessary to handle the output of the parser and generate PIR code to feed into Parrot. When this program runs, it parses our source and, if there are no errors, it feeds the results into the compilation code, which produces a file of PIR code. From there you can feed the PIR to Parrot and see it run. This compiler fully parses the source before handing the results off for code generation, but you can easily intersperse parsing and code generation if you want.
Basic Parser Behavior
The way we have Parse::RecDescent set up, it generates a tree
for each statement. This is a series of arrays of arrays (of arrays of arrays,
and so forth) with each matching rule producing an array whose elements are the
rule name and the matching sub-rule elements. If what matched for a rule is
another rule the element will be an array reference, while if the rule matches
one or more strings and regexes, the elements will be string constants.
For example, consider the statement
declare foo
That matches the declare rule, which has two elements, the
string constant declare and a field expression. The field
expression itself is just a regex that matches a word, at least as Perl defines
it. That means the result is:
['declare', 'declare', ['field', 'foo'] ]
The inner rule for the field produces an array of two elements, the rule
name (field) and the string that matched the regex
(foo). The outer rule produces a result with three elements, the
rule name (declare), the string declare, and the
result of the field subrule.
The parser will, when it's done, hand the code generation routine a tree for each statement that it parses. Since each node and subnode is tagged with the rule that it matches, the easiest way to process this tree is with a table of functions for the rules. For our grammar, the table would look like:
my (%handlers) = (
addval => \&handle_generic_val,
assign => \&handle_assign,
cmpval => \&handle_generic_val,
constant => \&delegate,
declare => \&handle_declare,
expr => \&delegate,
field => \&handle_field,
float => \&handle_float,
logval => \&handle_generic_val,
modval => \&handle_generic_val,
mulval => \&handle_generic_val,
parenexpr => \&handle_paren_expr,
print => \&handle_print,
simplevalue => \&delegate,
statement => \&delegate,
stringconstant => \&handle_stringconstant,
);
Using a table like this makes the compiler easily extendable — if you add in a new node you just add in a new entry in the table. You'll also note that some of the rules share functions. This is fine, and we'll get into why in a little bit.
Emitting Header Information
Before we start processing the output of the parser, we need to put out some boilerplate header information. This is a common thing for compilers to do, and you'll find yourself with a small catalog of boilerplate code. In this case, the boilerplate looks like:
.sub __MAIN prototyped, @MAIN
.param pmc argv
This declares that we're in a subroutine named __MAIN, which is
prototyped, and takes a single PMC parameter argv. Being in a
subroutine is mandatory; Parrot requires that all code executed must be in a
subroutine, function, or method. For us, with our simple sub-less language,
the name is irrelevant, but in languages with real subs and methods the name is
often meaningful. The @MAIN tag on the subroutine marks it as the
starting subroutine for our program. When parrot executes our program, this is
where it starts.
At the moment, Parrot assumes that the first subroutine in the first file that's executed is the driving sub, though we may change that later on.
The parameter, argv, goes unused in our example language, but
it's an array PMC that holds the command-line parameters passed into
parrot when our program started.
Emitting PIR Code
After the boilerplate is emitted, we can start processing the trees of nodes the parser has generated. If we mandate that each node-handling function returns the PIR code needed to process that node, the code driving code generation can look like:
foreach my $node (@nodes) {
my (@lines) = process_node(@$node);
print join("", @lines);
}
In this case, we assume that the @nodes array holds all the
statement nodes from our parsed program. We flatten and process each node in
turn, printing out all the code for the node. The function to process a node
looks like:
sub process_node {
my (@elems) = @_;
if(exists($handlers{$elems[0]})) {
return $handlers{$elems[0]}->(@elems);
} else {
return "***", $elems[0], "***\n";
}
}
This function (the version in the listing has some extra error checking) just looks to see if it can find a function in our global handler hash to handle the node that's been passed in. If it can, it calls that function, passing in the flattened node data. If it can't, it emits an easy-to-find string that notes the node that it couldn't find a handler for. This comes in handy when debugging the compiler later on, as it makes it easy to find where you've found a token you don't know how to deal with.




