Robin Berjon

XML Bad Practices

Human-Readable Text in Attributes

It is often tempting to place text intended for humans inside an attribute, perhaps so as to "attach" it more directly to the element, or to make authoring more terse. The archetypal example of this being:

]]>

The issue here is that this approach breaks down as soon as one starts requiring structure inside the string. For instance, if instead of using a title element to specify the titles of sections DocBook had chosen a title attribute on the section element, it would be impossible to have the title be "The Foo interface".

That might seem like an acceptable limitation, but it gets worse: if the text is expected to be potentially in any language, there will be cases in which it will require structure. For instance, some Chinese or Japanese text requires what are known as ruby annotations (basically text that is rendered on top or to the right of the primary text to indicate the pronunciation). Similarly, it can be useful to specify the writing direction when mixing languages that go in different directions (e.g. Arabic and French). Also, due to limitations in Unicode, some characters will not render correctly (i.e. will be rendered with the wrong glyphs) unless you specify which language the text is in — that is notably the case of the CJK set in which Unicode gave some Chinese, Japanese, and Korean text the same code-point even though they are depicted differently in each language (a rather scandalous decision, but one we have to live with). For this case one could place a lang attribute on the element to get the right effect on the text inside the attribute, but that would set the language of the entire element, not just of the title text. It's an extreme case, but if one had an img element pointing to an image of a wine label from France, with an alt attribute in Korean, and set the lang to kr so that the alt renders right, then the language of the label would be said to be Korean too.

Given the technicalities involved in getting I18N right, and given the greater extensibility of the element approach, it should be inferred that text intended for human consumption should only occur in element content. That being said, the original argument — terseness — has some merit for authors. In the case in which it is desired, it is therefore possible to define a two-tiered approach in which such text can occur in either an attribute or a child (with the child taking precedence). That approach however has drawbacks, and needs to be used with caution.

This article is part of a series on XML Bad Practices.