2007.10.24

A graphical tracer for Perl 6 regexes based on PCR

Tracing parse failures by hand while developing a compiler can be really a daunting task, because the real problem can appear everywhere -- the grammar definition, the input string being matched, or even the regex engine itself.

So I've implemented a graphical tracer for Perl 6 regexes atop PCR (Pugs::Compiler::Rule). You can find some small online demos here:

To generate the HTML pages (say, the first demo) yourself, simply check out the Pugs repository , ``cd'' into perl5/Pugs-Compiler-Rule, and enter the following commands:

  $ perl util/compile_p6grammar.pl -D \
      examples/digits.grammar > Digits.pm
  $ echo '7c3d54' > digits.input
  $ perl -Ilib -MDigits -e \
     'undef $/; print Digits->count(<>)->(), "\n"' \
     digits.input > trace.out
  $ perl util/gen-tracer-view.pl --outdir tmp/digits \
      examples/digits.grammar digits.input < trace.out
  $ firefox tmp/digits/index.html

As of this writing, the tracer interface still needs love and I'm adding cooler features like ``random jump'', ``stepping in a specified pace'', and ``stepping back''. If you like to help, just join #perl6 and get a Pugs commit bit ;)

I think this tool should be very useful for both regex engine developers and compiler writers, especially when parsing fails in an unexpected way. And it can also be beneficial to Perl 6 beginners who want to learn the shiny new regex syntax and complicated matching behaviors by just ``stepping through'' the actual parsing process. moritz++ said on #perl6 that he would build a CGI script to make my demos above ``alive'' when he had the tuits; let's just wait and see ;)

It should be warned that the regex syntax supported by the current PCR implementation is already a little out-of-date regarding the lastest Synopsis 5 . (Thanks TimToady++ for tweaking the regex syntax during the meantime ;)) I'll try to port KindaPerl6 's perl5rx backend over to PCR later.

Hey, it'll be nicer to have KindaPerl6, PGE , Parse::RecDescent, or even the Perl 5 regex engine to work with my tracer as well! :)

Stay tuned!

-agentzh

2007.04.16

YAPC-SA-2007 Hackathon

Here is the report on the 4 days of the YAPC-SA-2007 Hackathon, from April 12 to 15, 2007, in Porto Alegre, Brazil (during fisl8.0).

v6/docs
- added the KindaPerl6 diagram
- added a KindaPerl6 memory layout diagram - with help from #perl6
- added the MiniPerl6 talk (pdf)

MiniPerl6
- started the Perl6-in-Parrot target: desugars MiniPerl6 into p6parrot supported features
- grammar fix: semicolon is optional after a block
- perl5 backend: implemented warn()
- new: mp6-to-AST script
- nferraz started porting Test.pm to mp6

KindaPerl6
- memory management: BEGIN-block side effects are logged
- updated plan, discussed a possible release schedule
- fixed the container proxy algorithm (the implementation needs to be fixed; pugs has the same problem)

v6.pm
- fixed the "fish-operator" =<>
- fixed the quoted-word-operator <...>

Pugs::Compiler::Rule
- ratchet emitter fix, array interpolation now works

misc/pX
- nferraz and fglock started an adventure game implementation (with v6.pm), to be presented by nferraz at the Nordic Perl Workshop.

The network proxy at the conference did not allow commits to pass through, and there will be some delay until all the changes get to the repository.

Thanks to cmarcelo, david_fetter, eden_c, lorn, merlyn, and nferraz for 4 great days!

- Flavio S. Glock (fglock)

2006.09.17

PCR replaces PGE in Pugs

Tonight, Audrey implemented the bridge between Pugs' Haskell core and the Perl 5 module Pugs::Compiler::Rule, thus bringing rules support to our pugs ``for free''. This is really good news to us. :)

For more than one year, Pugs required parrot to provide full Perl 6 regexes (rules) support since Pugs didn't have its own grammar engine. In those early days, parrot's PGE (Parrot Grammar Engine) was the only choice.

However, obtaining and compiling parrot is a daunting task for most inexperienced users. And furthermore, the Pugs team has been struggling with the interoperability between these two ``animals'' to ensure they play together well. Everyone must still remember that parrot embedding was almost like a nightmare.

Hence it was hoped that pugs could get rid of the parrot dependency completely. With fglock's excellent work started from the beginning of this year, it is a reality today. Pugs::Compiler::Rule (PCR) is a pure Perl 5 implementation for Perl 6 regexes. Although there's still a lot of missing features in PCR (see its TODO file for details), we have already got a simple Perl 6 compiler based on PCR, which has passed ~1000 tests in the Pugs test suite. So it's good enough. And you can try out Perl 6 regexes without parrot anymore:

    pugs> 'abc' ~~ /\w+/
    Match.new(
      ok => Bool::True,
      from => 0,
      to => 3,
      str => "abc",
      sub_pos => (),
      sub_named => {}
    )

Audrey says it reuses the now-on-by-default perl 5 embedding feature of pugs. It also works for Windows users since I've helped her to solve the long-overdue p5 embedding problem on ActivePerl 5.8.x. And gone is the long-standing message ``perl5 embedding is not available on Win32'' while building Pugs. Yay!

Note that the PCR integration work is still under way, but we believe it's a good start anyway. :D

Update:

It is still undecided if Pugs should keep PGE as an alternative. I think leaving the choice to the user is more sensible. :)

2006.06.22

use v6-pugs;

now invokes the new Perl5-based Perl6 compiler:

use v6-pugs;
"hello, world".say;

$ perl -Ilib hello_world.pl

It can even be called from the command line:

$ perl -Ilib lib/v6.pm -e ' "hello, world".say; '

It still fails many of the sanity tests, as well as nearly all of the other tests under t/.

2006.04.30

More whitespace sensitivity.

Today I put in the finishing touches to predictive parser, with support for "$x .= foo" and "foo.bar". In both cases, whitespace is now very important:

$x .= foo     # $x .= $x.foo
$x .=(foo()) # same thing
$x .=(foo)   # same thing
$x.=(foo)    # $x.postfix:<=>(foo())

There is no built-in Perl 6 classes with the postfix:<=> method, so the last one is probably an error.  Another example:

foo.bar;   # foo().bar()
foo.bar(); # same thing
foo .bar;  # foo($_.bar())

That is surely the correct behaviour, but my brain (too used to backtracking dwimmy parsing) still need some time to adjust from the change.  The test files, too, needs some adjusting: Of the 630 test files, 76 are still failing parsing -- usually, it's a missing whitespace here or there, but there may still be some ambiguities hidden in the corner.  Helps welcome to triage those tests!  If you'd like to help, look at this smoke test report, grep for " 0.00%" (with the space), run "./pugs -Iblib6/lib t/path/to/test.t", and try to diagnose the error.

In other news, aufrank++ started to re-read S06: Subroutines, in order to start working on a Rules-based parser for the full signature syntax, complete with the not-yet-supported features such as array/hash/structural unpacking.  Once that grammar is in place, we can use scw++'s work to use it as part of Pugs's Parsec parser, and also pack it up into a Module::Compile-based bridge to Data::Bind, to provide a modern successor to the highly impressive Perl6::Bindings for Perl 5.  Yay!

Less parsefails and more parseideas.

During the waiting-time in today's porcelain painting session, I finally figured out the way to implement the "any line ending in a semicolon (except for whitespace) can terminate a statement" rule:

my &foo := {
    say 1;
    say 2;
}   # Look, no semicolon!
my &bar := {
    say 3;
    say 4;
}   # ditto

See S04: Statement-ending Blocks for details.  The solution turns out to rely on a Parsec feature: setInput, that tells the rest of the parsing to operate on a different string, instead of the current one.  With this, for statement-level parsing, we can parse the block (up to the closing '}'), mark the position, parse some whitespace, and see if the line number has changed.  If yes, then it's a statement-ending block, and we insert a semicolon right there.  However, I wonder how to port this trick into Perl 6 rules...

Also I implemented .[0] and .<name> (shorthand for $_[0] and $_<name>), as well as the nice "@foo .= map(&sqrt)" mutating-method idiom.  During the discussion in #perl6 that followed, I then stumbled upon a possible solution to the long dot not short enough problem -- thanks Juerd++ for the excellent summary on the perl6-language list!

2006.04.27

Parsec is Rules engine.

Today I started the process of making Parsec a valid Rules engine, due the the whitespace boundary (<ws>) handling required to distinguish these two forms:

if %ENV{'PATH'} { 'foo' }  # one "if" statement
if %ENV {'PATH'} { 'foo' } # two statements: "if" and bare-block 'foo'

This is because whitespace is never allowed before postfix operators in Perl 6, and we rely on this property to avoid infinite lookaheads for cases like "$x < 3" versus "$x<3>".

I went for the most convenient approach, namely keeping a UserState of the character class of the previous Char -- \s, \w, or others.  It works as expected, but makes parsing 30% slower again, completely offsetting the gain of conversion to predictive parsing.  I'll run some profiles and see if I can optimize it back to speed. I also added a UserState for offset, for .from/.to support. (Previously it only maintains lines and column numbers.)

The next step would be add UserState of the current named/positional captures, and then change the result type to from "a" to "Match a", then voila -- we'll have a Rules engine.  scw (recently resurfaced on #perl6) offered to write a MiniPerl6-to-Haskell compiler, so we can have a shared Perl 6 grammar and the AST production rules for Perl 6, and translate them into Perl5/Parrot/Haskell as needed.

I also pointed spinclad to the optable.c skeleton, in the hope that the three implementations -- each have an operator precendence parser, none of which particularly fast -- can share a stable, C-based implementation.  My C-fu is weak, so I'm glad to see interest on that front. :-)

In other news, particle has been populating the Parrot/Perl6 tree with symbolic and named builtin prototypes, imported from Pugs's primitives table, so unary and nullary functions can start to parse properly. Yay!

2006.04.19

Some parsing bits.

Short update today, as tomorrow $job resumes and I need to catch enough sleep -- unfortunately having to miss the midnight-#parrotsketch session yet again.

This afternoon I checked in a 10-minutes hack to src/rules/, the beginning of librules, a C-based implementation for the operator precedence parser (the token half of the Perl 6 parsing cycle), in an attempt to enhance the PIR-based PGE::Match and PGE::OPTable's performance, and provide a common layer for PIR/P5/Hs-based Rules implementations to call into.

The current design is heavily influenced by libsyck, and currently I use Judy arrays for sparse arrays and hashes, because the C API is so extremely sane. I also plan to use them for the Haskell runcore, to replace the atrociously slow IntMap-based array implementation. The other missing piece of this new-runtime puzzle is string-buffer-fragments, but that's for another day...

In other news, I checked in an updated version of cognominal's Siva PMC as pugscapture.pmc, a PMC that implements the invocant/sequence/mapping structure within a Capture object.  As Match objects are conceptually just a subclass of it, it may also prove to be useful for that purpose as well. Plus XML tag/children/attributes, as well as a lot of other things.

Anyway, that's it for now.  See you tomorrow. :-)

2006.04.16

Backing out from the backtracking track.

Most of my hacking time today was spent on getting our Parsec-based parser to not backtrack so much.  But first, a little background discussion:

Currently, Perl 5 regular expressions almost always backtrack into previous parts, when the latter parts cannot match.  For example, the match below would backtrack ten times, each time trying to match different length of a, before finally giving up.

    "aaaaaaaaaab" =~ /^a+\d/

Because of this, complex regexes in Perl 5 often exhibit surprising (or downright pathological) time/space behaviours when applied to unanticipated source text.

In Perl 5, we can use the esoteric "never backtrack over this" device (?>...) to suppress backtracking:

    "aaaaaaaaaab" =~ /^(?>a+)\d/

This will eat up all the /a+/, seeing the trailing "b" does not match \d, and immediately give up there without backtracking.  TimToady noted that there are a dearth of scalable parsers written in Perl 5 regexes, which may be related to the primitive (and non-widespread) support of backtracking control.

Within Perl 6 rules, a single colon introduces a don't backtrack over me point, so the above regex can be rewritten into this far more elegant form:

    "aaaaaaaaaab" ~~ /a+ : \d/

However, @Larry is currently pondering making non-backtracking the default for
rule{...} constructs (as opposed to rx{...}, which still backtracks by default).

Now, the Perl 6 grammar was designed to be parsable with very little backtracking.  However, my early spike-prototyping demands, as well as previous brainwashing from over-exposure to Perl5 regexes (cf. Template::Extract /Template::Generate), resulted in my incessant abuse of backtracking in the Pugs parser, making it much slower than it needs to be -- especially on
parsefails.

Concretely, I abused the fast commit-by-default Parsec into a mostly backtrack-by-default parser, by sprinkling Pugs.Parser with try calls, which has the opposite effect of colons in Perl 6 Rules -- i.e. feel free to backtrack if this happens to not match.

Acting on TimToady's suggestion and luqui's observation of the obvious insanity of this approach, today I set forth to remove all try in Parsec's alternation rules (denoted by tryChoice).  Unsurprisingly, everything broke down and Prelude.pm failed to parse.  So I havn't committed it -- but there is a preliminary patch.

However, this exercise uncovered several hidden corners in Perl 6's grammar, where backtracking wasn't supposed to happen, but was nevertheless deemed to be neccessary by Parsec.

The most relevant one is the condition-part rule, as used by statement_control forms, such as if/unless/while/until.  Because Perl 6 no longer requires parentheses around conditions, we can easily get an ambiguous parse:

    if foo { 3 } { 4 }

this can mean two different things:

    if (foo) {3}; { 4 };
    if (foo {3}) { 4 };

Instead of continuing the current obviously-costly way of relying on backtracking, TimToady and I devised a clever solution, by specifying that bare blocks cease to be valid argument terms in the condition expression, unless protected by some kind of parens.

Because of this, whitespace-followed-by-opening-brace at term position always terminates the parsing of that part, and the "if (foo) {3};" interpretation always wins.  To get the other interpretation, write it as such with explicit parentheses.

I'm pretty happy that the tuned Parsec seems to be significantly faster at parsing sanity tests.  (As an aside, I also committed the same batch of sanity tests to the PGE-Perl6 tree -- the other 10k tests needs to wait until it can handle Test.pm.)

If I get some more hacking cycles tomorrow, I'll check it in and see if it can make another dent on the multiple-hours make smoke process.

But till then... Good night, and see you tomorrow. :-)

2006.01.02

Grammar support.

Today I added two more rules in the default grammar, in addition to <p6rule>:

rule p6namedrule
    { rule \s+ ([\w|<'::'>]+) \s* \{ <p6rule> \} ;? \s* }
rule p6grammar
    { ^ grammar \s+ ([\w|<'::'>]+); \s* <p6namedrule>* $ }

With this, the PILN grammar is now successfully parsed with its Syntax.grammar file.  To support that, tewk and I have implement various contructs:

  • Enumerated character classes: <[abc]>
  • Complemented character classes: <-[abc]>
  • Literals: <'abc'>
  • Begin/End anchors: ^ $ ^^ $$
  • Partial parse of <p6rule> so it stops at the closing brace when used as part of <p6grammar>.

In addition to that, I've written a simple tree trasforming function for <p6grammar>, which takes a Match object and turns it into compiled rule expressions.  Instead of using TGE-style separate set of attribute grammars as decribed in the Parrot compiler tools plan, I'm associating each rules with consumers, using e.g. SYB Generics to handle post-subrule-parsing production.

However, that scheme means we needed to perform multiple redundant traversals.  I'd much rather something like this, with a conjectural :{} rule trait:

rule p6namedrule
  :{ ($0 => $<p6rule>) }
   { rule \s+ ([\w|<'::'>]+) \s* \{ <p6rule> \} ;? \s* }
rule p6grammar
  :{ Grammar.new(name => $0, rules => hash(@<p6namedrule>) }
   { ^ grammar \s+ ([\w|<'::'>]+); \s* <p6namedrule>* $ }

Some discussion on #perl6 ensued; luqui and I bounced some ideas about embedding attribute grammars into the rule declarations, and promptly went into tangents such as the evilness of keeping $! as global -- $!, $/ and $_ really should all be environmental -- expect some p6l threads about them in the near future...