Friday, August 04, 2006

XML is huge and ugly

Photo from Commons by Over Fresh, public domain.
It's high time someone said that. There is a nice and elegant subset of XML that everybody uses. It looks more or less like this:
<foo color="blue"><p>Some text &amp; and a bit more</p><br/>
So we have tags, attributes, text and the standard &amp;-entities for escaping (&gt; < & " '). All in UTF-8. And maybe also HTML-compatible entities and comments, but they are already a bit annoying. Now the XML standard is way fatter and uglier than that. It contains:
  • DTDs. DTDs do not follow the XML syntax, and according to the standard, they can be dumped straight into any XML document and the program is supposed to handle all that. And they do more than just validation ! They can set attribute default values, define text replacements and do a lot of other useless things. This is the worst thing about XML. Of course nobody actually dumps such things into documents, the most people do is a single (and ugly anyway) DOCTYPE declaration, as if we couldn't use MIME types for that.
  • Non-standard entities. What does &foo; mean ? Well, it can mean anything. And it really sucks, because the program doesn't want to deal with escaping issues. So the program wants "AT&T", not "AT&amp;T". And what is parser supposed to return when it gets "&foo; &amp; &bar;" ?
  • CDATA - Yeah, let's provide a second and completely redundant way of escaping characters to make everyone's life harder.
  • XML declarations. These <?xml ... ?> things that can specify version and encoding. As if the standard couldn't simply say "XML documents are encoded in UTF-8".
  • Processing instructions. So now every program is supposed to somehow deal with <?mspaint ... ?> randomly splattered through the document. They don't even have to follow the tree structure, so where the heck is the parser supposed to attach them in the parse tree ?
What do they all have in common ? They're rarely used, come from SGML, and they make the XML model ugly and complicated. Some of this cruft like CDATA can be gotten rid of during parsing. Others (like nonstandard entities) are simply screwed up beyond any repair and we're better off ignoring the standard and doing some sort of the usually right thing heuristics. Oh, and by the way the new version of magic/xml is out there. It supports XML parsing using parser from REXML. The generation part is already fairly magical, and supports any mix of functional, object oriented and procedural styles of XML processing, so you can say:
node = XML.foo { bar!("Hello"); bar!({:color => "blue"}, "world") }
to get a node equivalent to <foo><bar>Hello</bar><bar color="blue">world</bar></foo>. Enjoy :-)

No comments:

Post a Comment