Bruce Lawson's personal site

XML on the Web

I was asked by John Foliot (lovely bloke; I owe him some sightseeing and good single malt whisky) and Shelley Powers (never met her but am enjoying her book Painting The Web) to defend my statement that the rules of XML make no sense on the Web.

About three years ago, it was fashionable to argue about whose standards-willy was the longest. People mocked other people for using a transitional doctype because transitional was somehow just playing at xhtml. Another, more esoteric web standards cudgel was mocking those who served their xhtml content as text/html rather than proper XML.

I kept silent during those debates, as I was embarrassed by standards inadequacy. I used a transitional doctype and happily served my content as “tag soup” (as it was called by the longwillies) even to the browsers that could deal with XML.

Why? I used a transitional doctype until my HTML5 redesign this year because strict is too strict with user-generated content in comments. It’s impertinent to expect a commenter to know the rules of markup and—for example—use paragraphs inside a blockquote, and impossible to enforce. So why bother? Why risk invalidity or arse around with comment-sanitising plugins?

It’s similar story with serving content as real XML rather than text/html. Firstly, Internet Explorer doesn’t understand it, so you have to do content negotiation. For me there seemed no reason to do that, as the browsers that can understand proper XML seemed not to do anything special with it.

That’s the problem with XML as a Web format. It’s all risk and no advantage. By risk, I mean the legendary draconian XML error handling. Simply put: at the first well-formedness error, the browser must stop parsing and report an error. Mozilla’s yellow screen of death is an example of this.

That’s intolerable. In my day job I used to work on a site with potentially thousands of authors, ranging my team who validated everything to a lady who worked on Tuesdays and Wednesdays uploading job vacancies through a dinosaur CMS. It would be completely absurd for large swathes of information critical to our customers to break because of an unencoded ampersand in some third-party advertising content.

Any website that has user-generated content or non-geek authors cannot afford to risk being “real” XML, particularly when browsers have historically been tolerant of errors. (See Tim Berners-Lee’s annoucement that “The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn’t work.”)

Now, for re-purposing content, XML is super. RSS feeds are a tremendous example of this. But who has ever hand-written an RSS feed?

Primarily (and getting all touchy-feely on you) the Web isn’t a data transfer mechanism, it’s about communication. Forgiving browsers and liberal validation rules lower the barrier to entry to publishing on the Web.

Imagine you had a friend who spoke excellent English, but occasonally made small mistakes with grammar or pronunciation; would you put your hands over your ears and shout “La la la I can’t hear you!” until they corrected those errors? Of course you wouldn’t. So why would you do the same with the Web?

Personal opinion; nothing to do with my employer, wife or hamster.

Added 31 January 2011: an interesting article by James Clark called XML vs the Web:

XML isn’t going away but I see it being less and less a Web technology; it won’t be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.

58 Responses to “ XML on the Web ”

Comment by Rhyaniwyn

I don’t think I said “I really like draconian error handling.”

I said that the emphasis on validation and consistency within a logical framework, such as XML provides, allows different applications, written by different developers, of differing skill, from all over the web and world, to work with each other and with web content more easily: predictably. XML also provided exciting potential to extend HTML.

The time and resource sink caused by having to interpret ambiguous input is a severe detraction that can be lessened by “strict” — read “consistent” — syntax rules. I see the attraction of failure for that reason, but do not consider draconian error handling a positive thing for “usability” in any sense.

I never meant to imply that XML is the only way to accomplish those goals of extensibility, semantic richness, etc. I said simply that XML, aside from it’s shortcomings (which any specification does have), provided a consistent framework to accomplish those things.

My only argument regarding draconian error handling in XHTML is a reminder that error handling is not the entirety of the specification. Since it is instadeath, I can sympathize with “all risk, no gain.” But I feel XML compatible HTML had good ideas behind it; ideas I don’t want to see forgotten in the HTML5 furor.

It remains to be seen whether other specifications will offer what XML seemed to offer. Seemed certainly being a key word, yes.

HTML syntax kind of reminds me of the dress code at work. I read the whole thing once and thought, “Dammit, why don’t they just provide us uniforms?” It was 4 pages of, “Women may wear capri pants in the summer so long as they fall x inches below and x inches above the knee. Mens’ shorts that fall between the ankle and knee are prohibited. No shorts or skirts may be worn higher than 1 inch above the knee.” My MOM wasn’t that strict and she was pretty strict. I can’t figure out what kind of shorts men can wear, either, or if they just can’t wear any.

Comment by George Katsanos

I’m with Alexander Vlad on this one. Well said.
The HTML tag soup that’s all over the web, we “all”(I hope) agree we don’t like, so what’s the sudden 180 degree change about XML being “too strict”. I thought supporting validity and standards was what you’re doing for a living Bruce!
Or maybe I just didn’t get the purpose of this post. (what was it anyway?)

Comment by Bruce

@George ” I thought supporting validity and standards was what you’re doing for a living Bruce!”

It is. I passionately advocate using open web standards, and ensuring that they are used according to the rules of the language.

That’s what professionals do.

I support the HTML5 drive to allow, as valid, different authoring conventions (trailing slashes or not, uppercase tags or not, quoted attributes or not) where the differences don’t matter in the real world.

I support the browsers that render the Web forgiving bad markup so that we can read ancient pages, pages authored by non-professionals or assembled by crappy dinosaur CMSs. I don’t support draconian error handling.

Take the font element; it’s wrong to author with it. Professionals know that. But browsers must render it.

Comment by Michael Kozakewich

I was taught in college how to create my own DTD and validate by it, so I feel lucky.

I think WYSIWYG editors have a responsibility to the internets to create a tool that manufactures well-formed code. This goes beyond standards or draconics — it’s moral responsibility, when you’re creating something for someone to use in different places.

At that point, the majority of people would be writing what would end up being well-formed code. I don’t think anyone (except me) nowadays writes with actual text.
Heck, one could create a Google Document and export it to html.

Essentially, the landscape of the web is changing, and we’re on the climbing up the stairs onto a thin layer of abstraction, where people don’t really need to see the code. We used to get “PASTE THIS INTO YOUR BLOG!!” quiz results; now we have an icon we press.
At this stage in time, well-formedness would be appreciated.

Comment by Dan

To me XML doesn’t belong on the web. That’s why XSL was invented – so you can transform XML into HTML or whatever, if you really must use XML to begin with.

Leave a Reply

HTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> . To display code, manually escape it.