The best kittens, technology, and video games blog in the world.

Thursday, August 10, 2006

magic/xml beats XQuery at W3C's XML Query Use Cases

Photo by Maria S. from CuteOverload! (fair use)It is now official - magic/xml is the best XML library ever ;-). magic/xml solutions to W3C XML Query Use Cases problem set are on average whole 1% smaller than XQuery solutions, and it probably does far better than that on more neutral benchmarks.

The Use Cases were really cool. Having a set of "real" (well, real enough) third-party problems made it possible to make magic/xml even more expressive. If I relied on my problems, they would probably contain my biases on what's the "right" way to process XML. Full sources can be viewed at magic/xml's website, including pretty-printed comparison of all solutions (pretty printing by Coderay). Here's just a short summary of cool new features.

XML.load(source) - stolen from CDuce's load_xml. source can be a file name, URL or a file handler. Oh yeah, and it supports "real" HTTPS now.

Pseudoattributes - XML has real attributes, but some people insists on using dummy elements. So you see a lot of <foo><bar>Hello</bar></foo>. magic/xml lets you pretend these are "real" attributes. You can read and even write to them and it will do the right thing: node[:@bar] += ", world!".

Multielement children/descendant paths - node.children(:p, :*, :ul, :li, :*, :a) can find links inside elements of unordered lists inside paragraphs. Well like XPath /p//ul/li//a basically. Now I could have implemented XPath, but we're already beating XQuery anyway, and I have a vague feeling that letting arbitrary objects be path elements we can do some really cool things using === and =~.

node =~ /regular expression/ - just a small cool thing, node =~ :foo matches if node has tag foo, node =~ /regular expression/ matches if text inside it (with markup stripped) matches, and so on. It's not that cool on its own, but if connected with multielement paths it could be something extremely powerful.

tree fragments - ever wanted to extract "part of BODY between second H1 and the first TABLE" ? node.range(start, end) and node.subsequence(start, end) can do the right thing for you (either including parents or not). The programs will become much easier to understand than iterating over individual nodes and checking whether they're in the right range or not.

So basically you now have 57 realistic examples of using magic/xml for XML parsing, enjoy :-)

No comments: