Friday, March 23, 2012

PyCon: Parsing Horrible Things with Python

See the website.

He's trying to parse MediaWiki text. MediaWiki is based on lots of regex replacements. It doesn't have a proper parser.

He's doing this for the Mozilla wiki.

He tried Pyparsing. (Looking at it, I think I like PLY better, syntactically at least.) He had problems with debugging. Pyparsing is a recursive decent parser.

He tried PLY. He really likes it. It is LALR or LR(1). PLY has stood the test of time, and it has great debugging output.

However, it turns out that MediaWiki's syntax is a bit too sloppy for PLY.

LALR or LR(1) just doesn't work for MediaWiki.

Next, he tried Pijnu. It supports PEG, partial expression grammars. He got it to parse MediaWiki. However, it has no tests, it's not written Pythonicly, it's way too slow, and it eats up a ton of RAM!

He wrote his own parser called Parsimonious. His goals were to make it fast, short, frugal on RAM usage, minimalistic, understandable, idiomatic, well tested, and readable. He wanted to separate parsing from doing something with the parse tree. Hence, Parsimonious's goal is to produce an AST. Parsimonious is very readable.

If you optimize a bit of code, write a test to test that the optimization technique keeps working.

In Parsimonious, you put all the parsing rules in one triple quoted string.

It has nice syntax.

See github.com/erikrose/mediawiki-parser/blob/master/parsers.rst for his notes on a ton of different parsers.

Proper error handling is a master's thesis in itself.

There are Python bindings for Antlr. However, he didn't want a separate code generation step, and it's slow.

You can do interesting things with sys.getframe and dis.dis.

2 comments:

Daniele said...

The correct link to Erik's notes is https://github.com/erikrose/mediawiki-parser/blob/master/parsers.rst

Shannon -jj Behrens said...

Thanks!