Real Semantic Markup

Roger Johansson over at 456 Berea Street, reflecting on a series of articles by John Allsopp regarding HTML semantics, asks the question: “Should there be another way of extending and improving the semantics of HTML without requiring the specification to be updated?”

Personally, I think the issue revolves around the misuse of HTML to mark up something other than research papers.

It is my understanding that HTML is a subset of SGML, a markup language used to mark up research papers for mass reproduction on offset printers. As such, the vocabulary (the tags) in HTML reflect the type of data being marked up. Consequently, when HTML is used to mark up documents that are not academic in nature (are not research papers), authors are left cobbling together solutions to retain the semantic value, but that rarely works. For example, if you want to mark up a mathematic equation, you’ll need the MathML specification precisely because HTML doesn’t have the vocabulary necessary for describing the content.

I find it a little ironic that Tim Berners-Lee has basically turned everyone into an academic in some sense, by enabling them to do massive research and post their findings. However, current technology limits us to “browsing” research papers, even though we’ve creatively found ways to publish much, much more than that.

I think the world is missing a browser that is able to render a variety of markup languages (vocabularies), including HTML, MathML, XHTML, XHTML2, XForms, SMIL, and others (although the last 2 are not technically markup languages). I can imagine a world in which marketers define their own markup specification for sharing data (a problem I think microformats are trying to solve) safely. In fact, markup languages can be defined for nearly any field. The problem is, we don’t have web browsers capable of rendering the data in the source documents in any meaningful fashion because no formatting information is associated with any of the elements of these foreign markup languages. In fact, I find it hard to imagine what a marketing database or recipe list would look like if not some kind of document.

So, in conclusion, I’m not sure if I’ve made my point, but basically I think any semantic improvements in HTML will come from focusing on the domain it was originally intended for (academia) than by trying to extend it to other domains that have little or nothing to do with writing research papers.


  1. Hi Gustave,

    I think the Geni of HTML becoming a universal markup language is well and truly out of the bottle, and I doubt it will go back in any time soon. In fact, that was one of the key motivators for my articles.

    While in an ideal world, XML plus perfect browsers may well provide the deal solution, I think the last 15 years demonstrate very clearly that the web of HTML and the kind of browsers we have now will be with us for a very long time to come. It’s both more frustrating, but ultimately worthwhile, to think about how we improve that world, well, at least in my opinion 😉

    thanks for the thoughtful comments


  2. You’re definitely right about the Geni being out of the bottle, no question about it. I guess I was just hoping to enlighten some of my regular readers, many of whom are probably unfamiliar with the origins of HTML.

    I hope my comments didn’t offend. I am extremely grateful for your articles on semantic HTML. They’re an excellent resource everyone should read (I printed out all three for easier consumption).

    Thanks for your input and for your work on HTML!


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.