The best kittens, technology, and video games blog in the world.

Saturday, August 05, 2006

XML stream processing with magic/xml

Photo from flickr, by katia., CC-BY.
There are basically two ways to process XML files:
  • Read them into memory as trees, and have all the cool methods. This works only with rather small XML files, especially since XML data takes more space in memory than on the disk.
  • Process them as streams of events. This is usually very inconvenient.
Oh yeah, and the ever popular "forget they're XML and just use regular expressions to extract the data", when we have a big XML file. This is usually more convenient than processing stream events ;-) An obvious solution to all problems would be processing the document as a stream of fragment trees. For example a Wikipedia dump consists of a header, and then a long stream of page entries. Even when the whole dump is huge, each page entry is tiny and easily fits in the memory. Perl's XML::Twig did something like that. But it wasn't that simple to use. Here comes magic/xml, with stream processing support that's easier that ever. Simply call complete! on a node and it magically gets all the children read. If you don't - you will keep getting its children in a nice stream. Here's a script that extracts article IDs and titles from a Wikipedia dump:
XML.parse_as_twigs(STDIN) {|node|
 next unless node.name == :page
 node.complete!
 t = node.children(:title)[0].contents
 i = node.children(:id)[0].contents
 print "#{i}: #{t}\n"
}
And it extracts all relevant data from XML which looks basically like that: <mediawiki> <siteinfo>...</siteinfo> <page> <title>Astronomia</title> <id>1</id> ... </page> ... <page> ... </page> </mediawiki> It took 19 minutes to process a 405MB dump (pl.wikipedia.org) on my Athlon 1400, but it only took 3MB of memory ! And if we replace REXML parser backend by something faster, it should get a real speed boost. And when we get YARV. Oh well ;-)

5 comments:

Anonymous said...

That looks very nice. But I would really like it if it used something else than REXML. Maybe Ruby libxml?

taw said...

It should be simple to make a switch to a different XML library. But there are so many, are there any benchmarks for them somewhere maybe ?

grcm said...

I'm also trying to use magic/xml for parsing Wikipedia dumps. It's great for small ones, but I think the REXML parser makes it very slow.
I believe libxml is the best one to use (libxml2 is the same thing)...
It would be great if there was a wrapper for REXML to use libxml if it was available!

grcm said...

I ended up using sed/grep to parse the Wikipedia dumps- REXML was just too slow. But I love the way magic/xml makes it easy to control.

I've found another thing- "requiring/including" magic/xml in a ruby/amazon script breaks ruby/amazon somewhere in REXML. Not sure why, I'm afraid.

taw said...

Giles: The breaking is most likely due to patching Regexp#===. For technical reasons, Regexp#=== cannot be implemented in plain Ruby - it modifies $~ ($1, $2, $3). So magic/xml used a hack to make it support XMLs. The hack was Binding.of_caller-based and worked fine in 1.8.4, but broke in 1.8.5.

Current packages have that hack turned off (the affected functionality was really minor), so they should work fine.