Most of my hacking time today was spent on getting our Parsec-based parser to not backtrack so much. But first, a little background discussion:
Currently, Perl 5 regular expressions almost always backtrack into previous parts, when the latter parts cannot match. For example, the match below would backtrack ten times, each time trying to match different length of a, before finally giving up.
"aaaaaaaaaab" =~ /^a+\d/
Because of this, complex regexes in Perl 5 often exhibit surprising (or downright pathological) time/space behaviours when applied to unanticipated source text.
In Perl 5, we can use the esoteric "never backtrack over this" device (?>...) to suppress backtracking:
"aaaaaaaaaab" =~ /^(?>a+)\d/
This will eat up all the /a+/, seeing the trailing "b" does not match \d, and immediately give up there without backtracking. TimToady noted that there are a dearth of scalable parsers written in Perl 5 regexes, which may be related to the primitive (and non-widespread) support of backtracking control.
Within Perl 6 rules, a single colon introduces a don't backtrack over me point, so the above regex can be rewritten into this far more elegant form:
"aaaaaaaaaab" ~~ /a+ : \d/
However, @Larry is currently pondering making non-backtracking the default for
rule{...} constructs (as opposed to rx{...}, which still backtracks by default).
Now, the Perl 6 grammar was designed to be parsable with very little backtracking. However, my early spike-prototyping demands, as well as previous brainwashing from over-exposure to Perl5 regexes (cf. Template::Extract /Template::Generate), resulted in my incessant abuse of backtracking in the Pugs parser, making it much slower than it needs to be -- especially on
parsefails.
Concretely, I abused the fast commit-by-default Parsec into a mostly backtrack-by-default parser, by sprinkling Pugs.Parser with try calls, which has the opposite effect of colons in Perl 6 Rules -- i.e. feel free to backtrack if this happens to not match.
Acting on TimToady's suggestion and luqui's observation of the obvious insanity of this approach, today I set forth to remove all try in Parsec's alternation rules (denoted by tryChoice). Unsurprisingly, everything broke down and Prelude.pm failed to parse. So I havn't committed it -- but there is a preliminary patch.
However, this exercise uncovered several hidden corners in Perl 6's grammar, where backtracking wasn't supposed to happen, but was nevertheless deemed to be neccessary by Parsec.
The most relevant one is the condition-part rule, as used by statement_control forms, such as if/unless/while/until. Because Perl 6 no longer requires parentheses around conditions, we can easily get an ambiguous parse:
if foo { 3 } { 4 }
this can mean two different things:
if (foo) {3}; { 4 };
if (foo {3}) { 4 };
Instead of continuing the current obviously-costly way of relying on backtracking, TimToady and I devised a clever solution, by specifying that bare blocks cease to be valid argument terms in the condition expression, unless protected by some kind of parens.
Because of this, whitespace-followed-by-opening-brace at term position always terminates the parsing of that part, and the "if (foo) {3};" interpretation always wins. To get the other interpretation, write it as such with explicit parentheses.
I'm pretty happy that the tuned Parsec seems to be significantly faster at parsing sanity tests. (As an aside, I also committed the same batch of sanity tests to the PGE-Perl6 tree -- the other 10k tests needs to wait until it can handle Test.pm.)
If I get some more hacking cycles tomorrow, I'll check it in and see if it can make another dent on the multiple-hours make smoke process.
But till then... Good night, and see you tomorrow. :-)
Audrey, do you mean a* instead of a+ in the first examples?
Posted by: dave glasser | 2006.04.16 at 01:10 AM
glasser: Good catch, I did start with a* but then changed my mind to a+, resulting in conflicting text vs. code... I've updated the text part. Thanks!
Posted by: Audrey T | 2006.04.16 at 12:14 PM