Robin Berjon

Smothered in Hats

The Death of Draco

Sometimes words have interesting origins. According to his Wikipedia entry, Draco was a 7th century BCE Athenian legislator who replaced the system of oral law and blood feud with a written code, posted clearly in public so that none could ignore it. By our lily-livered modern criteria, his laws are deemed harsh because the few offences that didn't call for meting out the death penalty enslaved their authors. But putting his deeds in context, one has to admit that a written, shared law that is not subject to arbitrary interpretation and the whims of elders is very much a progressive step.

XML, as we all know, follows draconian rules.

We have a similar situation with HTML and XML. I remember when the first rumours of XML started percolating about, and why I jumped on that particular bandwagon. Back then, I was exchanging data using an HTML-like syntax essentially defined as "whatever Perl's HTML::Parser gobbles up in the right way". The specifications were basically a bunch of examples. It worked great, a lot of the time. But once in a while some third party would use another parser, or an ancient version of the same, or even very much the same but with processing rules that didn't quite match, or documents that were just slightly off in very much the wrong way. And then I'd have to pull an all-nighter to figure out what went wrong instead of having beer with my friends. No one should be pried away from beer. Why didn't we use SGML? If you have to ask, you've probably either never used it, or have used it so well that you wouldn't understand what we didn't understand.

XML turned all of that into a pretty song. It was simple enough, and if you glossed over encodings, optional namespaces support, the external subset, and a few other such nits (which most of the time you could) it pretty much worked. Which is to say, you had interoperability.

Back in those days, this newfound and much cherished interoperability was attributed in large part to XML's draconian error handling. Unlike HTML parsers, which all did whatever they could with the content in possibly arbitrary and sometimes rather scary ways, you knew where you stood. You read the document? Then it's good. Bad document? Boom.

The problem is, that's not where the interoperability came from. Some smart minds did see that draconian processing and well-defined processing were different things, but for most of us unparsed masses those were one and the same. It is said that Draco was so praised by his contemporaries that in a massive show of approval apparently traditional in Ancient Greece, his fans threw so many hats and shirts and cloaks on his head that he suffocated, and was buried. I am not necessarily implying that there's a lesson in there. But it's tempting.

In the meantime, HTML has come a long way. It now benefits from a well-defined parsing algorithm that can guarantee interoperability. It is, however, not draconian. In that light, it's worth taking a look again at whether that halt-and-catch-fire approach was all that good. Given an XML parser and an HTML parser, if you feed them any input you get a predictable output. But for the XML parser some of these outputs will be empty, whereas for the HTML parser there will always be something. Note that I am aware of the sophistry that claims an XML processor does not have to catch fire but may indeed do something else, so long as it isn't claimed that an XML document was involved. That may be fine, but in the absence of a definition of what it might do, it returns us to unchartered waters.

Does that matter? I don't claim to have a final answer. There are issues concerning streaming that could warrant investigation. When using an XML parser, actions taken based on the stream of parse events ought to be transactional since you never know when the document might prove to not be well-formed. There are similar issues with HTML (reparenting comes to mind) but they are less drastic. Paradoxically, a sick, sick mind could claim that this makes HTML a better encoding for SOAP messages. SOAP5 and WSDL5 anyone?

Okay, okay, sorry. It's Friday night.

To me the question boils down to involuntary mistakes. It is often claimed that $somePercentage of mistakes made when producing XML should lead to something catching fire because they really are mistakes. I am always suspicious of such arguments, if only because they sound a lot like those used to claim that Java is a decent, modern programming language whose type system catches a lot of errors.

The problem with catching fire and producing an empty information set is that it doesn't put the recipient in control of what should be considered an error and what shouldn't. In HTML if you have an error you'll get a DOM, but it won't be the DOM that you expected. This can be tested for. I wonder, and this is very much an open question, if the sort of error handling that is desirable here wouldn't be better served using (possibly improved) validation technology rather than hardcoded catch-fire rules.

Draco's rules were an improvement. We might just wish to tone down the death penalty obsession.