XML on the Web

I was asked by John Foliot (lovely bloke; I owe him some sightseeing and good single malt whisky) and Shelley Powers (never met her but am enjoying her book Painting The Web) to defend my statement that the rules of XML make no sense on the Web.

About three years ago, it was fashionable to argue about whose standards-willy was the longest. People mocked other people for using a transitional doctype because transitional was somehow just playing at xhtml. Another, more esoteric web standards cudgel was mocking those who served their xhtml content as text/html rather than proper XML.

I kept silent during those debates, as I was embarrassed by standards inadequacy. I used a transitional doctype and happily served my content as “tag soup” (as it was called by the longwillies) even to the browsers that could deal with XML.

Why? I used a transitional doctype until my HTML5 redesign this year because strict is too strict with user-generated content in comments. It’s impertinent to expect a commenter to know the rules of markup and—for example—use paragraphs inside a blockquote, and impossible to enforce. So why bother? Why risk invalidity or arse around with comment-sanitising plugins?

It’s similar story with serving content as real XML rather than text/html. Firstly, Internet Explorer doesn’t understand it, so you have to do content negotiation. For me there seemed no reason to do that, as the browsers that can understand proper XML seemed not to do anything special with it.

That’s the problem with XML as a Web format. It’s all risk and no advantage. By risk, I mean the legendary draconian XML error handling. Simply put: at the first well-formedness error, the browser must stop parsing and report an error. Mozilla’s yellow screen of death is an example of this.

That’s intolerable. In my day job I used to work on a site with potentially thousands of authors, ranging my team who validated everything to a lady who worked on Tuesdays and Wednesdays uploading job vacancies through a dinosaur CMS. It would be completely absurd for large swathes of information critical to our customers to break because of an unencoded ampersand in some third-party advertising content.

Any website that has user-generated content or non-geek authors cannot afford to risk being “real” XML, particularly when browsers have historically been tolerant of errors. (See Tim Berners-Lee’s annoucement that “The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn’t work.”)

Now, for re-purposing content, XML is super. RSS feeds are a tremendous example of this. But who has ever hand-written an RSS feed?

Primarily (and getting all touchy-feely on you) the Web isn’t a data transfer mechanism, it’s about communication. Forgiving browsers and liberal validation rules lower the barrier to entry to publishing on the Web.

Imagine you had a friend who spoke excellent English, but occasonally made small mistakes with grammar or pronunciation; would you put your hands over your ears and shout “La la la I can’t hear you!” until they corrected those errors? Of course you wouldn’t. So why would you do the same with the Web?

Personal opinion; nothing to do with my employer, wife or hamster.

Added 31 January 2011: an interesting article by James Clark called XML vs the Web:

XML isn’t going away but I see it being less and less a Web technology; it won’t be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.

58 Responses to “ XML on the Web ”

Comment by Geoffrey Sneddon

You don’t have to show a yellow screen of death upon a fatal error in XML, you merely have to stop processing and give the user an error. WebKit does this and renders everything it has thus far parsed.

However, I take exception to the example of RSS: a huge number of RSS feeds are not XML, and if treated as such, will cause fatal errors. Admittedly, the number one cause of this is the document not being valid according to the character encoding used (which is a fatal error in XML-land), then followed by unescaped ampersands (often within URIs).

Comment by Bruce

Snedders – my bad; I meant to say the yellow screen of death was an example of draconian parsing; have amended that now.

Didn’t know that about RSS; they were counter-cited as an example of XML working properly on the Web, so Iassumed most of them were XML and vaguely wondered how they could be clean when the source HTML wasn’t. (I never stopped to investigate; feeds are like my car stereo to me: I’m glad it’s there but don’t give a shit how it works.)

Comment by Richard Conyard

Bruce,
Reading the above it would appear your chief problems with XML fall under two points:
1) XML error handling
2) Education for authors when rolling their own HTML

I’m with you on point one, it is far too draconian especially since automated routines can be employed to fix these errors on the fly (how often have browsers attempted to second guess bad HTML on websites).

Point two I’m with you in part, you cannot expect editors and authors to have an in depth knowledge of data transfer standards / mechanisms. You can expect authoring tools to support this though (both offline and CMS), which would remove a large part of the knowledge gap (especially when combined with less draconian error handling).

I agree with your argument about risk, but I cannot agree with your statement that there is no advantage. XML can deliver when combined with other technologies great advantages to the web (although not always in the user agent), a few that spring to mind are GRDDL, Microformats (from my understanding most readers convert to XML and then use XPath to read), REST (after all why not make web page API calls readable for machines and humans). It’s one of those technologies, but itself nothing special, what it allows to get done quickly and simply thereafter is where the benefits come in.

Comment by Sarven Capadisli

I agree with you that draconian error handling is fundamentally wrong for some things. Perhaps the Web is one of them.

I wrote a quick article comparing human and machine error handling when it comes to communication, which you might find it interesting: http://csarven.ca/human-language-and-html

There is one thing that bugs me though. How come we are not as tolerent for server-side/application coding? If we write gibberish code in any given language, we don’t expect it to compile or run, or even have the machine make something out of it. The machine is fairly dumb and it works within the given constraints.

I think XML in and of itself is a good thing and it is up to the interpreter to treat it however as it sees fit. For instance, if a Web page allows user comments to be submitted and displayed, and that there is no solid mechanism to scrape that input, perhaps the author needs to flag that and say ‘please don’t be so strict when you try to make sense of this document’.

Comment by Bruce

On Twitter, @rits said

replies mention GRDDL. I guess I’d first have to grok why RDF is useful before I can grok what GRDDL can do for the web

I’m with him there (although prefer not to derail conversation to talking of RDFa rather than XML)

Richard Conyard mentioned microformats. As far as I know, microformats work with any flavour of HTML whether there be an “X” in front of it or not. Am I barking at the wrong end of the stick?

Comment by Richard Conyard

@Jeremy @Nick I know that, I was referring to the tools that read microformats. It was my understanding that these normally use the HTMLTidy service to convert to XHTML (XML), before using XPath to read out each of the microformats from the page.

Comment by Rob

I have strong issues for those who feel the problem with XML, or any web standard, is that they want to be able to write code containing errors and get away with it with no consequences.

Comment by Nick Fitzsimons

@Richard: that’s one approach, and I wouldn’t be surprised to find it’s the most common. I’d probably use the general approach of parsing the content to a DOM, then using XPath to extract data from that.

As to whether there is any necessity for the source to pass through a state of being that could be described as XML, well, that depends on the parser one uses to get to the DOM stage. A tag soup parser could equally well be used, assuming it wasn’t prone to generating a malformed DOM (e.g. it guaranteed its output to be acyclic).

In addition, using a tag soup parser that generated SAX events in a coherent manner would allow one to inspect the stream of events for relevant data and avoid the DOM generation stage. Can a stream of SAX events be seen as being in some sense an XML document? It’s an interesting philosophical question ;-)

Comment by John Foliot

should have language must rules grammer good no? confusing understanding tag soup sometimes rules lax perhaps? bar raise the across board – draconian good not agree – loosey goosey answer not either! open + close not hard. & = fail &amp = good silly – agree

Forutsigbare regler bety bedre forståelse

(Scotch for will come some day)

Comment by Bruce

@nick said

As to whether there is any necessity for the source to pass through a state of being that could be described as XML, well, that depends on the parser one uses to get to the DOM stage. A tag soup parser could equally well be used, assuming it wasn’t prone to generating a malformed DOM

That’s one of the great advantages of HTML5, of course: it attempts to ensure a predictable DOM in all circumstances.

Comment by Tantek

Bruce,

As others have noted, microformats work in pretty much any flavor of HTML. More specifically, all version of HTML 3 and later, not just the majority of web pages today that are written in HTML4.x and XHTML1.x. Here are some pages with more info:
* http://microformats.org/wiki/html3
* http://microformats.org/wiki/html5

In addition, the vocabularies defined by microformats can be used in HTML5’s microdata feature; stay tuned for microformats spec updates that provide details and examples accordingly.

Tantek

P.S. I very much agree with the points of your post. Draconian error handling is dead on the Web, a failed experiment. A subtler form of this lesson is the “no required properties” principle that we’ve learned in the microformats community over the past year, from experience with seeing how and what authors do with formats when publishing. Again, expect microformats spec updates to incorporate this new principle.

Comment by Vlad Alexander

You wrote: “Forgiving browsers and liberal validation rules lower the barrier to entry to publishing on the Web.” Great! Wonderful thing! Now, what are the consequences of this?

When you don’t require authors to write to specification, authors will take the path of least resistance and generate garbage markup, which in all likelihood will not be accessible. The reason we don’t have an accessible Web today is because we make it too easy to publish on the Web by not enforcing standards. So you are making it easier for one group of people to use the Web at the expense of another group.

Comment by Bruce

Hi Vlad

I’m not sure I understand your argument. I believe that people should write to specification; I’m not suggesting otherwise.

But, given that 95.87% of the Web doesn’t validate, it is folly to suggest that the vast majority of the Web shouldn’t render (or stop rendering at the first error) – which is how XML error handling sees it.

There is no correlation between accessibility and validation. It is simple to make a valid site that is completely inaccessible; there are many, many sites that are perfectly accessible but have validity errors.

You say “The reason we don’t have an accessible Web today is because we make it too easy to publish on the Web”.

I say “The reason we have the Web we have today is because it is easy to publish on the Web”.

Failing on an un-encoded ampersand makes it much, much harder, but that has no bearing on accessibility at all.

The barrier to accessibility is two-fold, in my opinion; accessibility is relatively hard to do as it requires “bolting on” extra information in markup, and the society-wide problem that people do not care enough about people with disabilities.

Comment by Shelley

First, thanks for the post, Bruce.

Regarding your comment:

“But, given that 95.87% of the Web doesn’t validate, it is folly to suggest that the vast majority of the Web shouldn’t render (or stop rendering at the first error) – which is how XML error handling sees it.”

That’s actually not true.

XML Error handling will fail when certain syntactic constructs are not followed, including closing quotes, not using certain HTML named entities and so on, but you can easily have a web page that is valid XML markup, but still invalid XHTML.

Validity has nothing to do with XML processing.

Frankly, the requirements for XML are no different than the requirements for PHP, which is the server-side language that runs most of our web sites. If the language rules for PHP are not met, pages will also not display. And again, this has nothing to do with validity.

Comment by Julian Reschke

“Any website that has user-generated content…”

Unless I’m missing something, these sites will be open to CSS attacks, unless they are sanitizing the user’s input, in which case that problem should disappear.

Comment by Vlad Alexander

You wrote: “I believe that people should write to specification; I’m not suggesting otherwise.”

By advocating, “forgiving browsers and liberal validation rules” you are suggesting people should not write to specification. Your statement says to people that there is no need to write HTML, simply write anything that resembles HTML and the browser will figure out how to render it. (Just so there is no confusion, invalid HTML is not HTML, it’s something that looks like HTML).

You write: “it is folly to suggest that the vast majority of the Web shouldn’t render (or stop rendering at the first error)”

Nobody is suggesting that for existing content. The advocates for validation on the Web are talking about future content written to future specs. In other words, should future specs like HTML5 perpetuate the problems of the past or set a new course for the Web.

You wrote: “(or stop rendering at the first error) – which is how XML error handling sees it.”
FYI, in the past, there were Web browsers that did not render pages that contained invalid HTML. If it were not for the bad practices of Microsoft and Netscape, the Web today would be mostly valid HTML.

You wrote: “There is no correlation between accessibility and validation.”

It’s more accurate to say that accessibility and validation are not coupled together but there is a statistical “correlation” between them. I don’t have the numbers to back this up but from my experience, accessible sites tend to have fewer validation errors and valid sites tend to be more accessible.

You wrote: “The reason we have the Web we have today is because it is easy to publish on the Web”

I disagree. I believe that Web would have simply grown at a slightly slower pace, and who is to say that is a bad thing since we did a lot of daft things during the hyper-growth period during the browser wars, such as inventing the FONT tag.

You wrote: “The barrier to accessibility is two-fold, in my opinion; accessibility is relatively hard to do as it requires ‘bolting on’ extra information in markup, and the society-wide problem that people do not care enough about people with disabilities.”

Bruce, you are absolutely right. And, how do we as a society currently deal with this in the “physical” world? This is done through legislation and enforcement. Yet in the “online” world of the Web, we have legislation (the HTML spec) but no enforcement (browser validation).

Comment by Bruce

Hi Shelley

you can easily have a web page that is valid XML markup, but still invalid XHTML.

…because it could be well-formed XML, but using elements that are not XHTML, right?

But not closing tags, not quoting attributes, uppercase tags, not encoding ampersands, character encoding problems are common authoring errors, and prevent well-formedness and so would cause XML errors. Well-formedness is harder than valid, imo.

the requirements for XML are no different than the requirements for PHP, which is the server-side language that runs most of our web sites.

But far, far fewer people write PHP than HTML. PHP is programming; HTML is markup.

Comment by Rhyaniwyn

I call myself a web developer, but I’m not a very good one, so please take me with a liberal dash of salt.

I don’t argue that draconian error handling is a good thing, but I am very insistent that bad code should generate errors.

Error handling itself is so complex — I spend more time coding around possible errors than anything else. I consider it a usability function to have good error handling. I value usability.

But as a general rule, each time you handle an error you are forced to make assumptions about the code (or input). You are forced to guess what the coder or user meant by the bad code/input. There are some cases in which your educated guess will smoothly and transparently deal with “problem.” There are other times that your assumptions will, as poppa always said, make an ass out of me and u.

In some sense error handling is interpretive and as such is highly vulnerable to misinterpretation. When you are attempting to communicate with someone, misinterpretation is can severely disrupt the communication process.

An analogy would be my attempting to email someone in French. I don’t know French, so I might use Google Translate on the e-mail I send and the e-mail I receive. If we’re lucky we’ll be able to muddle along, but we are certainly taking a risk that we won’t be able to understand each other, that we will misinterpret each other, and we are certainly communicating with far less clarity than we could given a common language.

On the other hand, I can remember times I’ve had a sensitive conversation with someone I didn’t know very well. The conversation was most productive and least upsetting when I took care to politely let them know I wasn’t sure what they meant and asked for clarification.

English is said to be hard to learn because its grammar is so arbitrary. Rules which apply most of the time will not apply in a few excepted circumstances. Spoken languages are organic and drift, but languages with more consistent and specific grammatical rules are accounted to be easier to learn.

Why should this not be the case with (X)HTML? To someone who really does understand coding it’s easy to remember that attributes with spaces must be quoted…it’s not just a memorized rule, it actually makes sense to us because we have some basic understanding of parsers. This “strictness is a barrier” argument seems to reject the idea that to someone who really doesn’t “get it” these lax rules are akin to a lengthy and arbitrary grammar that must be memorized by rote and called forth each time HTML is written. I conjecture that it’s easier to remember a rule that holds true in all cases than one which is circumstantial.

The XML on the web effort seemed to me to be a step in the right direction…an acknowledgment that we had been “muddling along” and we could do better. Was every “X” standard and every detail in every X standard a good idea that supported simplicity and clarity? No, but I don’t think HTML5 is wonderful in every detail either and I feel that some of its theoretical basis is highly flawed. This backlash against XML on the web, the “ivory tower” criticism…it’s creating a false dilemma between XML and HTML. It’s rejecting the entirety of the theoretical basis for the “X” movement and the entire “X” suite of standards without applying any critical regard. The fact that there are legitimate issues does not render the entire effort dysfunctional.

Regardless of my fuzzy points above, I wanted to address the specific reasons you rejected XHTML strict:

1. p in blockquote – That isn’t a fault of XML as I see it. XHTML strict had a lot of rules that I found arbitrary about document structure. Like a had to be in some block level element. It actually makes some sense from a perspective of a block/inline elelments, but it was simply contrary to common usage. There were rules in XHTML strict that were good ideas and rules in XHTML strict that perhaps should have been approached from another angle. To me, that is not an argument against XML’s suitability for the web, it is evidence that standards processes are too remote from the “common coder” (EVEN HTML5).

2. content negotiation to serve as xml due to IE – That’s like arguing that CSS doesn’t work for the web because IE is such a bitch about getting it right. There’s tons of stuff IE doesn’t support or doesn’t support correctly.

3. user input – as posters above have pointed out, it’s a known fact that user input must be analyzed and sanitized…it would have to be done whether or not XHTML had a rule about blockquote/p.

Comment by mattur

@Vlad:

In other words, should future specs like HTML5 perpetuate the problems of the past or set a new course for the Web.

Problems of the past like, er, being wildly successful? Revolutionising the world? Creating a massive interconnected information repository that anyone can use and publish to – an unprecedented feat in the history of humanity? Those problems…?

If it were not for the bad practices of Microsoft and Netscape, the Web today would be mostly valid HTML.

TimBL: “I support Marc completely in his decision to make Mosaic work as best it can when it is given invalid HTML.”
http://1997.webhistory.org/www.lists/www-talk.1993q3/0745.html

More on the history of XML error handling:

http://diveintomark.org/archives/2004/01/16/draconianism

Comment by Shelley

Yes, I meant well formed XML.

“But not closing tags, not quoting attributes, uppercase tags, not encoding ampersands, character encoding problems are common authoring errors, and prevent well-formedness and so would cause XML errors. Well-formedness is harder than valid, imo.”

Not really, not any more. If I didn’t want to occasionally use SVG at my site, I could use a filter that would automatically correct all that you mentioned, and generate proper XHTML when I write a post.

When I did have comments, I had this filter turned on, and it did correct errors. The few times it didn’t, a note to the person who created the filter resulted in a new library that fixed the original problem.

If I don’t have comments now, it’s because of my ambivalence about comments in general, not because I’m concerned about the fact that I serve all of my pages as XHTML.

Your mention of PHP and people not needing to be as familiar with it as HTML: most people use a CMS now for their sites, and these are a mix of HTML and PHP. Because of this, rarely does the complete neophyte do much with their site templates. Those who do work with the templates are well enough informed to ensure proper X/HTML.

What we need to require is that tool makers, or those who write applications that generate page content, should ensure what they create is well formed markup, so the contents can be served either as HTML or XHTML.

Now, I do agree that how Firefox handles XML errors sucks.

Comment by Vlad Alexander

mattur wrote: “Problems of the past like, er, being wildly successful? Revolutionising the world?”

Many technologies had similar impact, such as the Gutenberg’s press, telegraph, typewriter, audio/video cassettes, etc. After you create something wonderful, successful and revolutionary, you don’t just stop there, you work on improving it and ultimately replacing it with something even better.

mattur wrote: TimBL: “I support Marc completely in his decision to make Mosaic work as best it can when it is given invalid HTML.”

How is this relevant to the discussion about the future of the Web? In 1996 I worked for a browser vendor and when that browser did not render pages with invalid HTML, I was the loudest critic of that development decision. I now see that I was wrong and I take responsibility for my tiny role in steering the Web in the wrong direction.

Comment by Vlad Alexander

mattur, further development of that browser was discontinued and so was development of a dozen or so other browsers at that time from other companies, which did render invalid HTML.

Comment by Bruce

Thanks for an interesting comment Rhyaniwyn.

p in blockquote – That isn’t a fault of XML as I see it. XHTML strict had a lot of rules that I found arbitrary about document structure…To me, that is not an argument against XML’s suitability for the web

The vast majority of XML on the Web is XHTML. So to me, that arbitrariness is an argument against XML.

You also mention content negotiation to serve as xml due to IE, and imply that IE’s failings are behind my dislike of XML. Not at all! I work for a browser that has support for loads of stuff that IE can’t be bothered with.

As I said, I don’t see the point of doing content negotiation because I see no advantage in serving XML to browsers than can render it.

Can anyone tell me what would be gained by serving this page as XML to Opera/ Firefox/ WebKit?

it’s a known fact that user input must be analyzed and sanitized

Of course, to remove JavaScript, SQL injection. But to tidy up the HTML to conform to what you said were “arbitrary rules” like must be block-level elements inside blockquotes: what’s the point, except to conform to an arbitrary rule? It’s jumping through hoops for the fun of it.

Comment by Jim O'Donnell

So to me, that arbitrariness is an argument against XML.

Hi Bruce, surely that’s an argument against using HTML 4, rather than XML, since the rules for blockquote changed in HTML? By using the XHTML 1 Transitional doctype, weren’t you basically allowing comments written in HTML 3, with XML syntax? Which is fine, but seems a side issue from XML as the language of the web.

It does seem arbitrary to me that inline content is illegal, in HTML 4, for <body> and <blockquote> but allowed for <li>

Will similar issues, caused by changes in tag content models, arise with the move from HTML 4 to 5?

Comment by Rhyaniwyn

So to me, that arbitrariness is an argument against XML.

Don’t all web standards contain arbitrary ideas? Ultimately decisions must be made in a specification — even when there is no objective, logical, fact-based, bias-agnostic “best” choice.

Jim makes the point I intend to. There are also examples of seeming arbitrariness from HTML5 — footer was just changed because of a needlessly limited content rule. The cite tag’s content rule seems quite arbitrary to me. I recall having !!! moments when reading about the time tag — and a few other things I’d need to re-read to recall.

But I don’t want to make this an XHTML vs. HTML5 issue; it’s not. It’s just an issue of…well, holding ALL standards to the same, um, standard. I don’t agree with much of the preface to HTML5 and there are specific things about it I dislike…but aside from those moments when I am indulging in hyperbole, you won’t hear me say, “HTML5 is horrible and is going to hold the web back for another decade.”

Because it’s not really true and I find it difficult to hold vociferous and extreme positions on the “future of the web”. I am forced to acknowledge that there were, and are, pros and cons to every approach. We can insist that the “looseness” of HTML made it easy to use and that ease of use was what made the web take off like it did, but there is absolutely no evidence to back that up, it is merely a plausible theory. I could say the interpretive approach browsers took to rendering and the competition of the browsers wars was what made web design both easy AND exciting and ease/excitement is why the web took off like it did. But I don’t think anyone would say “Yes, that’s why we should advocate for browser-divergent interpretation of HTML/CSS and browser-specific extensions, that makes web design AWESOME! Standards are stuffy.”

I can’t say and don’t mean to say, “Oh, XML is GREAT for websites.” It ignores the reality of today and the implications of the last few years. Real-XML XHTML didn’t take off in any significant way. And none of the most exciting things I liked about the potential XML web took off either. It seems to “prove” that regardless of the “awesome potential”, it “just didn’t work.”

But because there were “future scenarios” about XML on the web that I found (and still find) exciting I’m not willing to dismiss XML with an indifferent wave of my hand. I fear that such a dismissal will become a blanket rejection of even the good ideas. Which is unnecessary, because many of those good ideas don’t actually depend on XML; they are just associated with it and XML provided a framework for them.

As it stands there are no real advantages to serving as XML. I’d be indulging an insane pro-XML bias to insist there are; I try to reserve lapses in sanity for my personal life.

Comment by Bruce

@Jim

Hi Bruce, surely that’s an argument against using HTML 4, rather than XML

Yes, you’re right. Brain fart on my part.

@Rhyaniwyn: I agree. I’ve written before about the time wierdnesses etc. XML is a good idea, in my opinion. But the reality (even ignoring IE) is that as you say “there are no real advantages to serving as XML” and, because of the draconian error handling, it’s all risk and no gain.

Comment by John Foliot

…because of the draconian error handling…

But Bruce, in today’s day and age, Draconian is not all bad, and in fact, as I mentioned elsewhere, all other web ‘languages’ require correct syntax and implementation today else things simply don’t work (Draconian Fail).
As a matter of fact, none other than Mark Pilgrim himself wrote last week “…IE and Safari both have a mode where they essentially say “I detected this page as a feed and tried to parse it, but I failed, so now I’m giving up, and here’s an error message describing why I gave up.”” (http://blog.whatwg.org/2009/09) That sounds like a draconian fail to me!

Why then should we feel that somehow tag soup HTML should get a pass? XML (or rather, XML rules in XHTML) provides a rigor and framework that serves to also teach web developers today the importance of well-formedness, and I venture to guess that had we had this ‘requirement’ from HTML2 that the web would likely not be that different today than what we have – with one exception – all of the content would be well formed and valid.

We’ve all come a long way since HTML 2, and we’ve learned a lot. *I* have faith that professional web developers today would have no problem working within those rules to gain the benefits of HTML5, and surely we can develop WYSIWYG tools that can produce and respect well formed text strings – XStandard does that today. Thus we now have an opportune time, with the release of a very different HTML than we had in those days, to change the rules and move forward with standards, including well-formedness, and there XML rules are pretty simple.

Comment by Bruce

Hi John

thanks for the comment. You said

Why then should we feel that somehow tag soup HTML should get a pass?

What is “tag soup”? I like the fact that HTML5 does not care if I say <meta charset=utf-8>, <meta charset=”utf-8″>, <meta charset=utf-8 /> or <META charset=utf-8>

That’s not tag soup. That’s just more “relaxed” validation. I’m not arguing against validation, only that things that don’t matter shouldn’t be invalid, and the browsers shouldn’t fail to render.

XML (or rather, XML rules in XHTML) provides a rigor and framework that serves to also teach web developers today

I agree that the rigor of rules and validation teaches web developers. It’s also an invaluable Quality Assurance mechanism. I always advise validating code. I just see no point in a validator rejecting out <img src=”foo.jpg”> because there is no trailing slash, when it doesn’t make any difference to the DOM.

we now have an opportune time, with the release of a very different HTML than we had in those days, to change the rules and move forward with standards, including well-formedness, and there XML rules are pretty simple

Well-formedness is a non-issue with trailing slashes, not closing some tags, quoting attributes etc. It just doesn’t matter.

What does matter in today’s browsers is that incorrectly nested tags produce different DOMs. That is an impossible basis upon which to build robust JavaScript to power Web Apps.

There are four rendering engines, and millions of developers. So to me, defining a parser that guarantees a consistent DOM and makes the browser do the work is better than stopping rendering at first failure, or requiring web developers to learn wierd rules.

(I can never remember which entity names I can use in XML and which I can’t, which tags require closing and which don’t.)

Comment by karl

Are you talking about XML on the Web, or it seems in your post about XHTML (or XML in a front-end context).

The XML usage context modifies a lot the quality of XML itself. There are definitely issues, but “XML on the Web” makes it so universal that it fails to address the real issue.

Also Geoffrey is right, the XML specification does not say “stop display the document”.

See http://lists.w3.org/Archives/Public/www-archive/2007Dec/0102

Comment by Richard Cunningham

Does anyone consider that the idea that HTML is just written is out of date now.

With Microformats, HTML in RSS, XFN, FOAF etc. there are many reasons that non-webbrowsers need to understand HTML.

For example I have some code which reads a HTML page in order to get the RSS feed from with in it. I have to use HTML tidy and then an XML parser to do this. RSS feeds themselves also have HTML in them and I have to use HTMLtidy to sanitise that too (remove javascript etc.)

I disagree on the relaxed validating of XML, if the author of a page wants miss off trailing slashes and quotes then the should use HTML not XHTML. Trailing slashes do matter because unless you have a whitelist of short tags you don’t know if they missed the closing tag or not. If you have whitelist of shorttags, what happens with new shorttags in old browsers/parsers.

Comment by Rhyaniwyn

Using XML on the web and writing valid XHTML (then serving it as XML) *are* different — “XML on the web” is far more vast. But they are closely related in my head, which is why I’ve been speaking of them almost as if they were the same.

One of the reasons I favored XHTML (one of those things that didn’t *really* come to fruition) was this idea that XML formats would be used all over the web to create this vast, flexible, semantic network. Then, because of XML’s clear and consistent syntax rules, all the content on this network could be parsed to create incredibly diverse, dynamic, rich, customizable web-anythings.

As I understood it, that was one of the reasons XHTML was pushed. This idea that if “front end” websites, which dominate the web, wasn’t written as syntactically correct XML it would severely limit the potential of XML on the web.

It’s not like you can’t parse HTML and we have a lot of tools to make it easier (because it *is* potentially so inconsistent). But the tool I’m writing right now that relies on XML as input/output is much easier to write and is far more reliable than the one I wrote just prior that had to scrape HTML. As a result my current tool is able to do more and has had more work done on the interface.

That’s neither here nor there, really; the point is that XHTML websites were an important facet of the ‘XML roadmap’. As I saw it, anyway, I’ve only recently begun addressing the more technical web stuff; I consider myself primarily a designer, or possibly a frontend developer.

But perhaps I should have been suspicious of the idea that XML would ‘save and improve the web’. I recall reading an article when Silverlight came out about how _someone_ is always pushing a new technology or platform that is going to be “the web, only way better.” Whoever wrote it made a point along the lines of, the web became what it is with HTML etc., do we really need a “new and improved web”? If we jump on board with something that’s “like the web, only better” is it really going to be all the things that made the web great? (…Assuming we even know for *sure* what those are…)

I agreed with that sentiment when applied to Silverlight et al. But it never really occurred to me to have the same antipathy for XML.

Perhaps I’m veering off the subject, but ultimately I see this conversation as being one about where the web will and/or should go in the future. No one likes change, but there is such a thing as progress…and progress is desirable. None of us are Cassandra (and if we were, no one would believe us or appreciate it). We don’t know for sure if our theory about the best direction is going to capture the imagination of the masses like the “original web” did or if it’s truly going to be adequate for the things we’ll want to do over the next 5 or 10 years.

XML seemed to me to provide a good framework for progress and expansion because, ultimately, expressed as XHTML, it *wasn’t that different a way to write markup.* No one was asking us front-end people to buy proprietary software and try to write ActionScript. Taken one step at a time it seemed sensible and easy. And it also seemed technically sound, to whatever extent I understand that concept.

Sure, there were problems, but why not address those rather than trash the whole thing? Not that it is, exactly. We use XML, more or less, for a lot of things on the web. But the response to HTML5…

*shrug*

I don’t know what I’m talking about probably, or if any of that made any sense, I have a tendency to ramble. And I know I equated and glossed over a lot of concepts that are complicated (probably far beyond my understanding). But hopefully I got my point across… :-)

Comment by David Hucklesby

XML parsing by browsers doesn’t have to be “draconian.” FWIW Opera offers to continue parsing a page as HTML when it finds an error…

Comment by John Foliot

@David Hucklesby: I wonder how much smaller Opera Mini could be if it didn’t have to account for sloppy error correction. The more consistent we are in feeding user agents, the easier it is for them to run, and at a smaller foot print/code base. Just sayin’…

@Rhyaniwyn: Nah, you got it right

Comment by bruce

@Rhyaniwyn – everything you said makes perfect sense to me.

On Twitter, Mr Jim O’Donnell (@pekingspring) said:

XML folk say: improve authoring tools, hide the complexity from authors, sanitise and tidy code with software…HTML folk say: Let authors author code by hand, relax validation to deal with varying quality of code they produce.

I think Jim characterises the situation correctly.

Above, Richard Cunningham says

With Microformats, HTML in RSS, XFN, FOAF etc. there are many reasons that non-webbrowsers need to understand HTML.

For example I have some code which reads a HTML page in order to get the RSS feed from with in it. I have to use HTML tidy and then an XML parser to do this. RSS feeds themselves also have HTML in them and I have to use HTMLtidy to sanitise that too (remove javascript etc.)

So, Richard, it sounds like you’ve solved the problem. You have no requirement for well-formed XHTML from the millions of authors of the code you’re parsing, as you deal with it. One person – a professional web developer: you – did the work instead of expecting all the world to use a wierd set of rules about character entities and trailing slashes.

And that’s how it should be.

John Foliot says on his blog post Are We Still Arguing About Validation?:

As for ‘professional’ web developers – are you really PROUD that you can skate by generating sub-optimal work and can get away with it? How would you feel if your PROFESSIONAL car mechanic did sub-optimal work? Or your PROFESSIONAL plumber? Or your PROFESSIONAL payroll clerk

I completely agree.

But most of the web isn’t written by professionals. I don’t want Joe Q Public to have to buy some expensive authoring tool.

I want him or her to be able view source, cut and paste and publish.

I want the blogger in Iran or the student in Peru or the schoolkids in Thailand to be able to tell me their stories without worrying about esoterica like trailing slashes.

Comment by mattur

I want the blogger in Iran or the student in Peru or the schoolkids in Thailand to be able to tell me their stories without worrying about esoterica like trailing slashes.

Absolutely. Publishing power to (all) the people!

The problem is a whole generation of web people has been conditioned to believe that consistent parsing/web standards/extensibility/the Semantic Web etc *require* stricter rules and well-formedness. It’s going to take a while for people to unlearn that.

Comment by John Foliot

Hang on a second…

Bruce notes that Richard Cunningham,

…(has) no requirement for well-formed XHTML from the millions of authors of the code you’re parsing, as you deal with it.

Right, he doesn’t need it from the end authors, but he DOES need it to be ‘valid’ to do what he needs and wants to do with that content post authoring. So he has scripts and other server-side solutions that converts that dirty ‘coal’ to clean ‘diamonds’. WYSIWYG editors like XStandard make it almost impossible to author non-conformant code, and I truly question how many non-professional authors today hand-roll their content with a text editor. Further, if they are learnng via ‘view source, copy and paste’ then where is the harm in teaching to use valid code vs. non-valid code?

I’m not arguing for draconian error handling with legacy content, but moving forward with HTML5 we have an opportunity to improve in what we have, with little-to-no additional cost to those multitude of content authors who simply want to write and publish – heck those content authors don’t hand-roll anyway, they use online tools like Blogger or My Opera; contribute via MySpace/Facebook/Orkut; engage and share at sites like WikiPedia or Flikr…

The past is the past: let’s learn, let’s strive for better, let’s aspire to improve rather than rest on the status quo. Asking for valid HTML will not set back the web, nor will it stop “…the blogger in Iran or the student in Peru or the schoolkids in Thailand…” from adding knowledge, experiences or perspective to our collective society. We already have the tools, scripts and server-side applications that deliver the possibility of serving valid code to the browser, so let’s just agree that this is a better way forward. It’s about the technology, not the source of the content.

Comment by Richard Cunningham

Actually what I do with the code that read webpages and HTML in RSS is try read it as XML and if that fails run HTMLtidy. HTMLtidy is quite complex and about doubles the memory footprint of my code when I load it.

TBH, I don’t have much hope that XHTML pages will become strictly validating, but creating browsers that let people get away with it is step in the wrong direction.

Nothing stops “blogger in Iran or the student in Peru or the schoolkids in Thailand to be able to tell me their stories without worrying about esoterica like trailing slashes” from going to blogger.com or emailing to posterous.com, bloggers shouldn’t have to know HTML. (X)HTML is for developers now, people who know what they are doing and they have created tools for everyone else. Even when I wrote my first web page 12 years ago there were GUI web page editors.

I believe the lax standards over the years in HTML is a big part of the reason we have so few web browsers now. Every time a page author makes a mistake, the browser has to work out what they really meant and what happens when different browsers have a different idea of what the author intended?

Comment by Isofarro

John Foliot says: “all other web ‘languages’ require correct syntax and implementation today else things simply don’t work (Draconian Fail).”

By “all other web languages”, presumably refers to CSS and JavaScript — both optional layers in the layered-cake approach of modern web development. HTML is the only layer that deals with actual content, and as such there is a higher pain barrier of draconian error handling.

In the hands of professional web developers validation is just a means to an end, and not the end itself. We’ve matured enough to realise when validation is harmful, and to accept the imperfection in the name of delivering on time, to spec, and largely accessible.

I see benefits in validation, but not enforced validation. Most of those benefits I see because of well-formedness, but that is a digression.

The most interesting content on the web isn’t at the head of the power-law distribution, it’s in the long long tail.

The most interesting conversations on the web happen in very small groups in sites you’ve never heard of, and never will. These people do just enough to be able to communicate – they don’t know about validity, wellformedness. They don’t know how to install blogs, find valid-markup generating plugins or extensions. Report filtering bugs. Nor should they need to.

The low barrier to entry on the Web is a feature, not a bug. Content that was good enough got rendered to visitors. The Web is a Good Enough environment. It doesn’t have to be perfect to be usable. And perfect doesn’t offer much additional value about Good Enough. What makes content good isn’t it’s validation score, but how useful it was for the intended recipients. Validation won’t make any difference to the quality of conversations.

We have centuries of experience of dealing with ambiguous messaging – yes it’s still not perfect, but we manage just fine. Sometimes it injects humour into our lives. I can’t see why the Web would benefit by removing this trace of what makes us human.

And Bruce is right. The Web isn’t just about web professionals. The point that ‘most websites use a CMS’ is trite and meaningless because it ignores the nature of the long tail.

Comment by mattur

John Foliot:

…so let’s just agree that this is a better way forward

!!!

The benefits of this “better” way forward are apparently so numerous no one can actually think of any.

Comment by John Foliot

mattur:

The benefits of this “better” way forward are apparently so numerous no one can actually think of any.

From this thread alone, here’s six:

“…and if that fails run HTMLtidy. HTMLtidy is quite complex and about doubles the memory footprint of my code when I load it..” (Richard Cunningham)

“…Every time a page author makes a mistake, the browser has to work out what they really meant and what happens when different browsers have a different idea of what the author intended?…” (Richard Cunningham)

“…I wonder how much smaller Opera Mini could be if it didn’t have to account for sloppy error correction. The more consistent we are in feeding user agents, the easier it is for them to run, and at a smaller foot print/code base…” (John Foliot)
“…One of the reasons I favored XHTML (one of those things that didn’t *really* come to fruition) was this idea that XML formats would be used all over the web to create this vast, flexible, semantic network. Then, because of XML’s clear and consistent syntax rules, all the content on this network could be parsed to create incredibly diverse, dynamic, rich, customizable web-anythings…” (Rhyaniwyn)

“…I agree that the rigor of rules and validation teaches web developers. It’s also an invaluable Quality Assurance mechanism…” (that’s 2 from Bruce Lawson)

Comment by mattur

“…and if that fails run HTMLtidy. HTMLtidy is quite complex and about doubles the memory footprint of my code when I load it..” (Richard Cunningham)

“…I wonder how much smaller Opera Mini could be if it didn’t have to account for sloppy error correction. The more consistent we are in feeding user agents, the easier it is for them to run, and at a smaller foot print/code base…” (John Foliot)

“…Every time a page author makes a mistake, the browser has to work out what they really meant and what happens when different browsers have a different idea of what the author intended?…” (Richard Cunningham)

These all make the same “won’t someone please think of the parsers…?” argument from 2003. It hasn’t improved with age. Opera Mini works fine today. Current HTML parsing methods work fine today. HTML5 defines the error recovery rules that you, John, rely on to publish your XHTML pages as text/html, today.

Computers get faster, more powerful and marginally less-stupid every year, while humans… don’t.

“…One of the reasons I favored XHTML (one of those things that didn’t *really* come to fruition) was this idea that XML formats would be used all over the web to create this vast, flexible, semantic network. Then, because of XML’s clear and consistent syntax rules, all the content on this network could be parsed to create incredibly diverse, dynamic, rich, customizable web-anythings…” (Rhyaniwyn)

Vast, flexible, diverse, dynamic, rich, customizable, motherhood, applepie, semantic networks do not *necessarily* require draconian error handling – unless they’re based on XML. Understandably, people (wrongly) think the future *has to be* harder and XML-based, because that’s what the W3C godchannel has been telling them for a decade.

“…I agree that the rigor of rules and validation teaches web developers. It’s also an invaluable Quality Assurance mechanism…”

Again, no one is denying that validation can be useful as a quality assurance technique for professionals. Professionals should always produce valid markup. Unless it’s more useful or convenient not to, obviously.

Google Wave

Google Wave uses HTML for UI. XML is great(-ish) for machine-machine data exchange. Web pages by/for humans, not so much.

Comment by Rhyaniwyn

@mattur
I don’t think I said “I really like draconian error handling.”

I said that the emphasis on validation and consistency within a logical framework, such as XML provides, allows different applications, written by different developers, of differing skill, from all over the web and world, to work with each other and with web content more easily: predictably. XML also provided exciting potential to extend HTML.

The time and resource sink caused by having to interpret ambiguous input is a severe detraction that can be lessened by “strict” — read “consistent” — syntax rules. I see the attraction of failure for that reason, but do not consider draconian error handling a positive thing for “usability” in any sense.

I never meant to imply that XML is the only way to accomplish those goals of extensibility, semantic richness, etc. I said simply that XML, aside from it’s shortcomings (which any specification does have), provided a consistent framework to accomplish those things.

My only argument regarding draconian error handling in XHTML is a reminder that error handling is not the entirety of the specification. Since it is instadeath, I can sympathize with “all risk, no gain.” But I feel XML compatible HTML had good ideas behind it; ideas I don’t want to see forgotten in the HTML5 furor.

It remains to be seen whether other specifications will offer what XML seemed to offer. Seemed certainly being a key word, yes.

HTML syntax kind of reminds me of the dress code at work. I read the whole thing once and thought, “Dammit, why don’t they just provide us uniforms?” It was 4 pages of, “Women may wear capri pants in the summer so long as they fall x inches below and x inches above the knee. Mens’ shorts that fall between the ankle and knee are prohibited. No shorts or skirts may be worn higher than 1 inch above the knee.” My MOM wasn’t that strict and she was pretty strict. I can’t figure out what kind of shorts men can wear, either, or if they just can’t wear any.

Comment by George Katsanos

I’m with Alexander Vlad on this one. Well said.
The HTML tag soup that’s all over the web, we “all”(I hope) agree we don’t like, so what’s the sudden 180 degree change about XML being “too strict”. I thought supporting validity and standards was what you’re doing for a living Bruce!
Or maybe I just didn’t get the purpose of this post. (what was it anyway?)

Comment by Bruce

@George ” I thought supporting validity and standards was what you’re doing for a living Bruce!”

It is. I passionately advocate using open web standards, and ensuring that they are used according to the rules of the language.

That’s what professionals do.

I support the HTML5 drive to allow, as valid, different authoring conventions (trailing slashes or not, uppercase tags or not, quoted attributes or not) where the differences don’t matter in the real world.

I support the browsers that render the Web forgiving bad markup so that we can read ancient pages, pages authored by non-professionals or assembled by crappy dinosaur CMSs. I don’t support draconian error handling.

Take the font element; it’s wrong to author with it. Professionals know that. But browsers must render it.

Comment by Michael Kozakewich

I was taught in college how to create my own DTD and validate by it, so I feel lucky.

I think WYSIWYG editors have a responsibility to the internets to create a tool that manufactures well-formed code. This goes beyond standards or draconics — it’s moral responsibility, when you’re creating something for someone to use in different places.

At that point, the majority of people would be writing what would end up being well-formed code. I don’t think anyone (except me) nowadays writes with actual text.
Heck, one could create a Google Document and export it to html.

Essentially, the landscape of the web is changing, and we’re on the climbing up the stairs onto a thin layer of abstraction, where people don’t really need to see the code. We used to get “PASTE THIS INTO YOUR BLOG!!” quiz results; now we have an icon we press.
At this stage in time, well-formedness would be appreciated.

Comment by Dan

To me XML doesn’t belong on the web. That’s why XSL was invented – so you can transform XML into HTML or whatever, if you really must use XML to begin with.