(Last Updated on )
I was asked by John Foliot (lovely bloke; I owe him some sightseeing and good single malt whisky) and Shelley Powers (never met her but am enjoying her book Painting The Web) to defend my statement that the rules of XML make no sense on the Web.
About three years ago, it was fashionable to argue about whose standards-willy was the longest. People mocked other people for using a transitional doctype because transitional was somehow just playing at xhtml. Another, more esoteric web standards cudgel was mocking those who served their xhtml content as
text/html rather than proper XML.
I kept silent during those debates, as I was embarrassed by standards inadequacy. I used a transitional doctype and happily served my content as “tag soup” (as it was called by the longwillies) even to the browsers that could deal with XML.
Why? I used a transitional doctype until my HTML5 redesign this year because strict is too strict with user-generated content in comments. It’s impertinent to expect a commenter to know the rules of markup and—for example—use paragraphs inside a blockquote, and impossible to enforce. So why bother? Why risk invalidity or arse around with comment-sanitising plugins?
It’s similar story with serving content as real XML rather than
text/html. Firstly, Internet Explorer doesn’t understand it, so you have to do content negotiation. For me there seemed no reason to do that, as the browsers that can understand proper XML seemed not to do anything special with it.
That’s the problem with XML as a Web format. It’s all risk and no advantage. By risk, I mean the legendary draconian XML error handling. Simply put: at the first well-formedness error, the browser must stop parsing and report an error. Mozilla’s yellow screen of death is an example of this.
That’s intolerable. In my day job I used to work on a site with potentially thousands of authors, ranging my team who validated everything to a lady who worked on Tuesdays and Wednesdays uploading job vacancies through a dinosaur CMS. It would be completely absurd for large swathes of information critical to our customers to break because of an unencoded ampersand in some third-party advertising content.
Any website that has user-generated content or non-geek authors cannot afford to risk being “real” XML, particularly when browsers have historically been tolerant of errors. (See Tim Berners-Lee’s annoucement that “The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn’t work.”)
Now, for re-purposing content, XML is super. RSS feeds are a tremendous example of this. But who has ever hand-written an RSS feed?
Primarily (and getting all touchy-feely on you) the Web isn’t a data transfer mechanism, it’s about communication. Forgiving browsers and liberal validation rules lower the barrier to entry to publishing on the Web.
Imagine you had a friend who spoke excellent English, but occasonally made small mistakes with grammar or pronunciation; would you put your hands over your ears and shout “La la la I can’t hear you!” until they corrected those errors? Of course you wouldn’t. So why would you do the same with the Web?
Personal opinion; nothing to do with my employer, wife or hamster.
Added 31 January 2011: an interesting article by James Clark called XML vs the Web:
XML isn’t going away but I see it being less and less a Web technology; it won’t be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.