Archive for the 'microformats' Category

On the talismanic fight between RDFa and microdata

A new fight has broken out in specland, between the supporters of RDFa and supporters of microdata. Observers may be wondering why; both are methods of adding extra markup to existing content in order that machines may better understand the content. Semantic Web proponents (note capital letters) dream of a Web where all content is linked by said machines. Semantic Web sceptics have more humble aspirations of search engines better understanding micro-content (is this string of digits a book ISBN, or a phone number?).

RDFa was part of XHTML 2. It became a W3C standard (or, in their vernacular, a “recommendation”) in 2008. microdata was invented by Ian Hickson as part of HTML5 because he identified deficiencies in RDFa. microdata was subsequently modularised out of W3C HTML5, but microdata is part of HTML5; it validates, whereas RDFa doesn’t.

Note the history. Like football fights that break out because one guy called an opposing team fan’s pint “a pouff”, this isn’t about the actual slight at all; this is about the past, allegiances and alliances; it’s a clash of world views. This is XML versus non-XML; it’s the XHTML 2 gang against the uncouth young turks of HTML5. This is Rangers vs Celtic; it’s Blur vs Oasis; it’s Tiswas vs Swap Shop.

(Added 15:30 GMT: R/e my framing the current debate as a talismanic battle, I should point out that I don’t mean Manu (whom I’ve always found to be courteous, thoughful and a jolly good chap). Neither do I mean Marcos, who isn’t a WHATWG-er. But some of the discourse “cowardice”, “suck metadata and fade, for all I care” on one side, and “TimBL’s RDF temple priests still mad as hell” suggests some, er, partisan feeling going on.)

What follows is the observation of a layman; I’ve not used much structured content, so am not an expert (I once tried to use microformats for events at the Law Society, but their accessibility problems prevented it.)

In my opinion, the primary deficiencies of Classic RDFa are that it’s too hard to write. For professional metadata-ologists it may be simple (but, hey, those guys understand Dublin Core!). The difficulty for me as an HTML wrangler was namespacing, CURIEs, and triples. This is XML land, and most web authors are not particularly adept with XML.

There’s also the problem that in order to use RDFa properly, you needed an xmlns attribute which is separate from the content you’re actually marking up (you don’t anymore in RDFa 1.1, see Manu’s comment). In a world where lots of content is syndicated via machine, or copy and pasted by authors (many of whom don’t really understand what they’re copy and pasting), this leads to breakage as not all of the necessary moving parts get transferred to their new environment. Hixie wrote

Copy-and-paste of the source becomes very brittle when two separate parts of a document are needed to make sense of the content. Copy-and-paste is how the Web evolved, so I think it is important to keep it functional and easy.

microdata solves this problem. It’s also easier to write than Classic RDFa (in my opinion) although I’m still mystified by the itemid attribute. I intend to start using microdata on this site soon (in order to plug the holes left by removal of the HTML5 pubdate attribute).

I’ve been recommending that people use microdata. Its main advantages:

Manu Sporny understood the problem that RDFa is hard to author for those of us who find the best ontology is a don’t-ology. Almost a year ago, he set about simplifying RDFa and came up with RDFa Lite. RDFa Lite greatly simplifies RDFa; in fact, you can search and replace microdata terms with RDFa terms (see his post Mythical Differences: RDFa Lite vs. Microdata).

RDFa has multiple advantages, too:

It seems to me that developers should just choose the one that meets their project’s needs. Need valid code Don’t need “full fat” RDFa, need a JavaScript API? Choose microdata. Care about Facebook, don’t care about a JavaScript API? Use RDFa Lite.

The current fight, however, won’t allow that. The RDFa gang want to stop microdata going further in the standardisation process because RDFa became a Recommendation first, and microdata is quite similar to it. (This is a controversial perspective; see Manu’s comment.)

While I completely understand that two competing standards makes it harder for developers in the short term, I agree with Marcos Caceres (who isn’t a WHATWG/ HTML5 zealot) who counters Manu Sporny’s objection to microdata progressing thus:

I don’t see what it being a “Recommendation” has to do with anything – just because it’s a W3C Recommendation does not mean that RDFa has a monopoly on structured data in HTML. So, just because that spec reached Rec first doesn’t mean that it’s somehow better or preferable to any other future solution (including micro data). That would be like objecting to Javascript because assembler (or punch cards) already meet all the use cases…

I hope you will instead focus your energy on convincing the world that RDFa is the “correct technology” on its own merits and not place your bets on a mostly meaningless label (“Recommendation”) given by some (much loved, but) random standard organisation.

I see no technical reason to favour microdata or RDFa Lite; both do the job. So, developers; which tickles your fancies? RDFa Lite or microdata?

microdata help, please

I’m trying to wrap my little bonce around HTML5 microdata, not least because Opera 12 pre-alpha has support for it.

I’m quite discouraged because the two articles I’ve read tell me that it’s easy, but I’m still stuck (although they are by noted brain-boxes Oli Studholme and Tab Atkins).

I’m befuddled over itemid:

Sometimes, an item gives information about a topic that has a global identifier. For example, books can be identified by their ISBN number.

Vocabularies (as identified by the itemtype attribute) can be designed such that items get associated with their global identifier in an unambiguous way by expressing the global identifiers as URLs given in an itemid attribute.

The exact meaning of the URLs given in itemid attributes depends on the vocabulary used.

What actually does this mean? How do I know if a particular vocabulary supports global identifiers for items?

In the spec, some Microdata vocabularies are listed. vCard is one, and the spec says “This vocabulary supports global identifiers for items.” The URL defining the itemtype for vCard doesn’t seem to tell me, and the examples in the spec make no use of itemid.

And, because I understand real examples rather than the theoretical, what’s the practical benefit of itemid?

Specifically, what would I gain by using


<div itemscope itemtype="http://vocab.example.net/book" itemid="urn:isbn:0867193719">
Rebekah Brooks' Self-portraits (ISBN 0867193719)
</div>

[example simplified from an example in the spec which says “The “http://vocab.example.net/book” vocabulary in this example would define that the itemid attribute takes a urn: URL pointing to the ISBN of the book.”]

over Schema.org’s Book schema (which doesn’t seem to use itemid – in fact, schema.org seems to make no mention of it):


<div itemscope itemtype="http://schema.org/Book">
Rebekah Brooks' Self-portraits (ISBN <span itemprop="isbn">0867193719</span>)
</div>

Double points if you can answer the question without baffling me with mentions of SPARQLy OWLs and Don’tologies.

rel=accessibility

While I was on my holidays there was a storm(ette) about rev=canonical and how it isn’t possible in HTML 5 because rev isn’t in the spec. (Apparently, the answer is to use rel=shortlink instead).

Mark Pilgrim published an article about link relations in HTML 5 with more information about the rel attribute, which I found interesting; I had no idea that relations such as rel=license and rel=author were available to allow auto-discovery of license information, and author details.

So I want to float the idea of rel=accessibility that would allow assistive technologies to discover and offer shortcuts to accessibility information, such as a WCAG 2 conformance claim, or a form to request content in alternate formats (for example).

The reason this would be useful is that links to such pages are generally right down in the footer of the web pages. This means that, for screenreader users, they have to navigate to the end of the page to find the link, or not know it exists.

Ironically, on sites that really do need a link to accessibility help (because of lack of structure to navigate with or huge amounts of content before the footer), those who need it are unlikely to find the link to the help.

In the “bad old days”, helpful developers would give an accesskey attribute to that link (which are generally undiscoverable to the human or to a parser, and which often conflict with assistive technologies’ command keystrokes).

A standardised way of indicating the related accessibility information would be better and not rely on arbitrary keys chosen by a developer.

So, should I propose that rel=accessibility be added to the list of values? It looks to be an arduous process; although you don’t need to prove your worth to the HTML 5 gatekeepers, you do have to prove your worth to the microformats gatekeepers.

I thought I’d ask you guys first—is this a good idea?

Microformats, accessibility, HTML 5 (again)

Microformats are a good idea, but some have accessibility problems because they expose machine data to humans by misusing the abbr element. These problems led to the BBC removing those microformats from their sites.

One such misuse is encoding dates and times in microformats such as hCalendar, hAtom, and hReview. Ultimately, this problem goes away in HTML 5, as that introduces a time element which is obviously better than an abbreviation for marking up dates and times (a tenet of microformats is to “use the most accurately precise semantic XHTML building block for each object”).

So, an example of an HTML 4 based hCalendar microformat (taken from the spec) is

<div class="vevent">
<a class="url" href="http://www.web2con.com/">http://www.web2con.com/</a>
<span class="summary">Web 2.0 Conference</span>:
<abbr class="dtstart" title="2007-10-05">October 5</abbr>-
<abbr class="dtend" title="2007-10-20">19</abbr>,
at the <span class="location">Argent Hotel, San Francisco, CA</span>
</div>

After replacing the abbr element with time and replacing its title attribute with datetime we get

<div class="vevent">
<a class="url" href="http://www.web2con.com/">http://www.web2con.com/</a>
<span class="summary">Web 2.0 Conference</span>:
<time class="dtstart" datetime="2007-10-05">October 5</time>-
<time class="dtend" datetime="2007-10-20">19</time>,
at the <span class="location">Argent Hotel, San Francisco, CA</span>
</div>

You can test how it renders in your fave browsers (and the other ones) on the microformats with time test page.

Replacing the abbr pattern elsewhere

Of course, this only works for dates and times. Other microformats use the flawed abbr pattern to code locations in microformats such as hCard, hCalendar & ‘geo’.

Here’s an example

Let’s go to <abbr class="geo" title="30.300474;-97.747247">Austin, TX</abbr>

Which, to a screenreader set to expand abbreviations, is an incomprehensible string of numbers (mp3):

“Thirty point three oh oh four seven four semi-colon minus ninety-seven point seven four seven two four seven”

A leading microformatter, Ben Ward, recently proposed an extension to the value-excerption pattern (no, I don’t know what that means, either) which allows this machine-readable information to remain in the DOM while hiding machine date from people.

My test page plays nicely with the following screenreaders:

  • JAWS 9 and JAWS 10 on Firefox 3 and IE 7 with high verbosity settings and abbr and acronym set to always be expanded (says Jared Smith)
  • NVDA on Firefox 3
  • Opera 9.63 and Voicover 10.5.5 (thanks Henny)
  • Window-Eyes with IE 7 on WinXP (thanks dotjay)
  • WinXP Window-Eyes with FF3 and full punc reads human date. It pauses before the .dtstart spans in Mouse mode, but not in Browse mode (thanks dotjay)
  • Safari 3.2.1 on OS X Leopard with VoiceOver (thanks dotjay)
  • WinXP Window-Eyes with IE 6 and full punc reads human date (thanks dotjay)

I’m really excited by this as it may be the end of the microformats versus accessibility debates (that I’ve helped stir up). If you have access to assistive techologies, please give it a test.

Problems with this pattern from a microformats perspective might be

  • The machine data is “hidden” so might more easily fall out of sync with the human data
  • The machine data must be the first child of the property. If it isn’t, a parser won’t see it – but it will be trickier to debug because the developer will still see it in the source code

I hope, if screenreader and parser testing allows, that the new pattern will be adopted so that those of us who want to use microformats and care about accessibility can use it.

The future of microformats in HTML 5

I had naively thought that many microformats would use the HTML data-* attributes, which are for “embedding custom non-visible data” and thus seem perfect for embedding such information on structutal markup.

Any element in HTML 5 can have any number of data-* attributes. The asterisk is a wildcard; you can call them what you want. An example from the spec:

<div class="spaceship" data-id="92432"
data-weapons="laser 2" data-shields="50%"
data-x="30" data-y="10" data-z="90">

However, the spec goes on to say

User agents must not derive any implementation behavior from these attributes or values. Specifications intended for user agents must not define these attributes to have any meaningful values.

I was uncertain what this meant, so asked Anne van Kesteren who told me that these attributes are for passing data to scripts that are private to the page, rather than to indicate meaning to external parsers:

It’s so that non-private extensions that need User Agent implementation a) won’t break sites using those names for other purposes and b) get due consideration by the Working Group and a proper name without data-

This is made explicit in an addition to the spec last night by the editor, Ian Hickson:

This is because these attributes are intended for use by the site’s own scripts, and are not a generic extension mechanism for publicly-usable metadata.

So, microformats won’t be “rolled up” into HTML 5. I imagine that some of the microformats community will wish to lobby the HTML 5 working group for a proper name for the data they wish to store with an element, so that the data can be parsed reliably without the “hacky” nature of the microformats. There is a process for adding new features to the spec.

Others will want to ignore caveats that HTML 5 places on using the data-* attributes for publicly-usable metadata and use them anyway.

Perhaps most likely, microformats will continue to use class=, rel=, and the like as they do now. That would be valid HTML 5 and require no changes to specs or parsers.

So it seems that microformats will continue in an HTML 5 world. And, now that there seems to be a will to fix the accessibility problems, I think that’s a good thing.