XML Bad Practices
A Schema Will Save You
Schemata (also known as schemas for the less pedantic) seem for some people to belong to an almost mythical dimension. Not only do many appear to believe that you need a schema to "define a namespace" (whatever that means), but even those bereft of that error altogether too often expect a schema to be step in — cape and superpowers flapping in the wind — to fix parts of language design that they wish not to think about. This article is part of a series based the paper on "Designing XML/Web Languages: A Review of Common Mistakes" which I presented at the XML Prague 2009 conference.
Different schema languages have different features, but the issues are largely the same whether
you expect a DTD's external subset to be processed, an xsi:schemaLocation
to have
an effect, or processors to universally ship with a built-in schema.
First and foremost, any advantage brought by schema validation primarily applies to documents that are valid. This provides for integration with lacunae values, error handling, or ignoring unknowns that is at best complex — though there could perhaps be value in having a common language with which to express such processing rules (some schema languages — Schematron, NVDL — support some of this, but none is expressly designed with versioning and error handling in mind).
It can be tempting to provide default values through an external schema rather than, or in addition to, through the specification. After all, even if the specification defines lacunae values, it would be nice if generic XML processors could also benefit from that information. This is generally useless, and occasionally harmful. It is useless because as explained above a lacuna value is different from, and more powerful than, a default value. Therefore, specifying default values for generic processors will lead to a mix-up where the values are sometimes right but not always.
It can be harmful if, as was done at one point in SVG, the external subset is used to default namespace declarations. That will lead to elements that are in a different namespace depending on whether the external subset was processed, which is not only optional but, in the case of Web technologies, rare. This approach further required authors of documents that used a namespace prefix for SVG to declare it in the internal subset.
Another such reliance on the external subset that will cause no end of trouble is to expect it to define entities. The typical example here is XHTML, which will regularly trip parsers that do not fetch the external subset and subsequently complain about undefined "nbsp" entities.
A schema can be useful for documentation purposes, and some tools can use one to provide authoring support. But unless you have a very strict workflow it will be essentially useless, and it certainly won't help your design (especially if you have to change your language to fit one schema language's idiosyncratic limitations — you should never have to design around the Unique Particle Attribution rule unless you can intelligibly explain what it is). To paraphrase the adage: a user had a problem and decided to address it using a schema; now she has two problems.