microdata help, please

I’m trying to wrap my little bonce around HTML5 microdata, not least because Opera 12 pre-alpha has support for it.

I’m quite discouraged because the two articles I’ve read tell me that it’s easy, but I’m still stuck (although they are by noted brain-boxes Oli Studholme and Tab Atkins).

I’m befuddled over itemid:

Sometimes, an item gives information about a topic that has a global identifier. For example, books can be identified by their ISBN number.

Vocabularies (as identified by the itemtype attribute) can be designed such that items get associated with their global identifier in an unambiguous way by expressing the global identifiers as URLs given in an itemid attribute.

The exact meaning of the URLs given in itemid attributes depends on the vocabulary used.

What actually does this mean? How do I know if a particular vocabulary supports global identifiers for items?

In the spec, some Microdata vocabularies are listed. vCard is one, and the spec says “This vocabulary supports global identifiers for items.” The URL defining the itemtype for vCard doesn’t seem to tell me, and the examples in the spec make no use of itemid.

And, because I understand real examples rather than the theoretical, what’s the practical benefit of itemid?

Specifically, what would I gain by using


<div itemscope itemtype="http://vocab.example.net/book" itemid="urn:isbn:0867193719">
Rebekah Brooks' Self-portraits (ISBN 0867193719)
</div>

[example simplified from an example in the spec which says “The “http://vocab.example.net/book” vocabulary in this example would define that the itemid attribute takes a urn: URL pointing to the ISBN of the book.”]

over Schema.org’s Book schema (which doesn’t seem to use itemid – in fact, schema.org seems to make no mention of it):


<div itemscope itemtype="http://schema.org/Book">
Rebekah Brooks' Self-portraits (ISBN <span itemprop="isbn">0867193719</span>)
</div>

Double points if you can answer the question without baffling me with mentions of SPARQLy OWLs and Don’tologies.

25 Responses to “ microdata help, please ”

Comment by karl

No benefits for the first one over the second one. As they are not really equivalent.

The first one says there’s an item which is a book (if vocab.example.net was defined) identified by a URI “urn:isbn:0867193719″ and this a non defined information about it “Rebekah Brooks’ Self-portraits (ISBN 0867193719)”

The second one says in this current page, there is a book with a bit of non defined information “Rebekah Brooks’ Self-portraits (ISBN and the ISBN is 0867193719.

The first one which has a global identifier makes it possible to mash it up with other Web pages in the world containing this unique identifier. (but I will not cite any SPARQLy thing ;) ) The information given though is not very useful because we do not know if there is a title or anything. :) it is just free text, it could be a comment or anything (for a machine)

The second one has an information which is “true” for this page and basically says There is a book and with with this isbn. A machine might do the work of collecting the ISBN. Though I’m not sure what do they do with spaces and stuff. :) ” 0867193719″, “0867193719 “, “0-867-19371-9″, etc. Schema.org doesn’t define how to write and process the values to make them unique and so being able to compile the information across the Web.

Comment by Philip Jägenstedt

A good rule of thumb here would be: If you don’t know if you need it, then you don’t need it.

The basic idea is that if you find two microdata items on different pages with the same itemid, they’re the “same.” This might be nice if you’re crawling the web for yummy items and want to avoid duplicates, but of little utility otherwise.

The schema.org people said “We strongly encourage the use of itemids” when asked, but haven’t documented what it actually does or why it is encouraged.

Comment by Lin Clark

I agree with Philip, if you don’t know if you need it, then you don’t need it.

However, I’d disagree that itemids are of little utility. It makes it much easier to combine data from different sites.

One problem that we ran into when I was a Web developer at a University was that each department was really ornery and wanted to use their own system for maintaining their Web site. However, we needed to get particular bits of information from each site (for example, contact info or publications).

All of the systems that were in use were incompatible and couldn’t easily share information… some were pure, Contribute-maintained HTML. However, they all could be altered to output something like microdata in a systematic way… at which point the sites could just be parsed for the relevant bits. The itemid’s would allow the main site to combine information about Prof. Foobar from different dept. sites.

In such a company, changing HTML markup slightly is a much easier organizational change and argument to make than standardizing on the same backend or requiring something like content negotiation for exposing data.

Comment by c_alpha

karl already gave a nice explanation of part of what I believe to have understood the itemid‘s purpose is.

As I read the spec, there are two facets:

Elements with an itemscope attribute and an itemtype attribute that references a vocabulary that is defined to support global identifiers for items may also have an itemid attribute specified, to give a global identifier for the item, so that it can be related to other items on pages elsewhere on the Web.
[…]
The global identifier of an item is the value of its element’s itemid attribute, if it has one, resolved relative to the element on which the attribute is specified. If the itemid attribute is missing or if resolving it fails, it is said to have no global identifier.

So it seems it is intended to refer to itemids from anywhere. The text speaks about resolving them, i.e. they are URIs, URNs or URLs.

This first use of the itemid touches on an unsolved issue: how to resolve – well – anything basically. For instance the ISBN URN you quote. This is a basic problem of Hyperdata (some call it Semantic Web). In your ISBN example, we would need to know how to establish a mapping between ISBN classification schemes. TV-Anytime and DVB are for example working on enabling such mappings. But overall we still have a long way to go before such information can be inferred automatically.

And then tere’s itemref:

Properties that are not descendants of the element with the itemscope attribute can be associated with the item using the itemref attribute. This attribute takes a list of IDs of elements to crawl in addition to crawling the children of the element with the itemscope attribute.

So a second use appears to be to group different items together by establishing a linkage between them. itemref gives you a directed graph, but whoever knows what happens if the mesh has loops.

In all it seems to me that the itemid enables you to “go up” from the “plane” of microdata items in your document, into a “third dimension” that links to other items – anywhere on the Web.

I agree with you that the idea is not fully cooked (apart from itemref-ing items in the same document). But at least it is a hook where Semantic Web tools will be able to hook into.

Comment by Philip Jägenstedt

“itemref gives you a directed graph, but whoever knows what happens if the mesh has loops.”

Fortunately, the spec knows ;) Validators will complain about such loops (http://bugzilla.validator.nu/show_bug.cgi?id=850) but they will be visible when traversing using the DOM API.

If you actually want loops, then itemid your best bet. Instead of a direct link, at the leaf of one item you’d have a URL which is the itemid of another. Just requires a little bit of glue.

Comment by Lin Clark

In general, the idea that similar technologies use is that you only create a *new* URL if you own the domain.

For example, for it’s employees, the University of Pittsburgh could use a URL like http://pitt.edu/team/jane-doe and Carnegie Mellon might use http://cmu.edu/person/joe-smith. The individual institutions would be in charge of making sure there weren’t collisions within their domain.

There is actually a high chance of collision with the URL structure I’ve shown, which is why things like Google+ use really large numbers instead of names. I’ve just used names for demonstration purposes.

Each organization decides their naming structure. If a consumer is unsure whether the ID relates to the thing they want to talk about, then they would ideally be able to look up that URL in a browser and end up at a page that has a description of that thing (i.e. a bio and picture).

For example, if I were working on a project across different universities, I could look up http://cmu.edu/person/joe-smith to ensure it is really the Joe Smith that I’m collaborating with by looking at the picture on the page. If it is, then I use that as my itemid.

Comment by bruce

Thanks Lin

so: is it fair to say that the benefits are (currently) more theoretical than real? (Nothing wrong if that’s the case, I hasten to add: I preach using elements like nav, header and footer which – as far as I know – have no real benefit at the moment.)

Do you have any insight into why schema.org doesn’t seem to use itemid, even though it would seem to be advantageous for search engines to be able to spot the same item/ person/ event etc across websites?

Comment by Lin Clark

I wouldn’t call it theoretical necessarily. Depends on your definition… if there was someone with dev time to dedicate and who knew what they were doing with microdata parsing, I’d say it’s within a week of being real.

There is a little bit of glue code that needs to happen between parsing the microdata and putting it into a really simple database. In Drupal, this would just mean adding one PHP class to a library that we use. It already has a parser for RDFa and for microformats, so I don’t imagine it would be that much of a long shot to write a parser for microdata.

Once that parser is added, I could do this today using tools in Drupal. I already have integration between the simple database and Drupal’s Views module, which is the most used module outside of Drupal core. I’d be able to point to the microdata pages that I want to use as my sources and display the results on my page. I’d also have a ton of display options (graphing, jQuery animation, etc) because once it gets to Views, it’s just like the data that’s been pulled from the site’s own database, which means it can be displayed with the hundreds of plugins that the community is actively developing. And to be clear, this doesn’t require writing your own query on the data. Views takes care of writing the query for you.

In my example, this would mean you could point to http://cmu.edu/person/joe-smith and whenever a new publication was added to that page, it would also show up on your page. You might have a jQuery slideshow that flips between publications from all the collaborating people from Pitt and CMU and other universities.

As Philip said, if you don’t have a use case where you want to share data like this, then it doesn’t make sense to use itemid. But if you do have such a use case, then itemid is key.

RE: schema.org… In the thread that Philip posted, one of the schema.org folks clarified that they just haven’t documented it yet. I think that they will probably first want to work on getting the basics adopted. IIRC, for the first month or two the tool support wasn’t even there for checking if the results showed up in Google Rich Snippets, so schema.org is not a finished product yet, which they try to make clear in their FAQ.

Comment by Manu Sporny

Hi Bruce,

Following up my response to you on Twitter:

http://twitter.com/#!/manusporny/status/96652822396940288

Let me try again – the problem isn’t you, it’s that we’re not doing a good job at explaining why it’s beneficial to give semantic things globally unique IDs on the Web.

Typically, when we start talking about Web pages, one of the first things you would ask me for is a link to the Web page, like: http://nyan.cat/ . That is, you ask me for something that you can use to uniquely identify that page to a web browser – the URL. We can then start talking about it, knowing that both of us are talking about the same thing:

“Bruce, have you seen http://nyan.cat/ ? I believe that song carries a message of deep cultural significance for all people of the world!”

Microdata, like RDFa and Microformats, is used to describe “things” on the Web. These things can be events, recipes, places, web pages, etc. We need to be able to identify these “things” on the Web just like we identify pages on the Web. @itemid (in Microdata) and @about (in RDFa) allow us to do this in a globally unique way. It allows us to assign a URL to the thing we’re talking about. So, if I name a “thing” on the web like this:

itemid=”http://nyan.cat/animation#song”

I can start making statements about that URL using Microdata or RDFa. If you use @itemid in your page to refer to the same URL, you can start making statements about that URL using Microdata or RDFa. If we’re using the same URL in @itemid or @about, we’re making statements about the same “thing” on the Web. We can talk about the same thing across Web pages – and that’s part of the foundation of this whole semantic web thing. The ability to make statements about anything we can give a URL to on the Web.

Now, we can also choose not to use @itemid – and so the “thing” we’re describing remains this amorphous, unnamed thing. We can still make statements about it – but there is no certainty on whether or not what I’m talking about and what you’re talking about are the same thing.

Also, I find your goat.se hamster with the pancake on it’s head at the bottom of this page, incredibly disturbing.

In Microdata:

<div itemscope itemtype=”http://schema.org/ImageObject”
   itemid=”http://www.brucelawson.co.uk/wp-content/themes/HTML5/images/oolongse.gif”>
   <span itemprop=”description”>A disturbing image of a hamster</span>
</div>

or in RDFa:

<span about=”http://www.brucelawson.co.uk/wp-content/themes/HTML5/images/oolongse.gif”
    vocab=”http://schema.org/” typeof=”ImageObject”
    property=”description”>A disturbing image of a hamster”<span>

Comment by Bruce

OOh, ooh – it might be a lightbulb going off above my head…

and you knew that in that vocabulary you were using (http://schema.org/) itemid was allowed, because it had a global identifier (because it’s based on “thing” and thing has a URL property which can uniquely identify it).

Comment by Philip Jägenstedt

It seems like itemid would often (always?) be redundant with schema.org’s “url” so in that case it doesn’t make a lot of sense.

It also seems extremely unlikely that enough people talking about the same thing would pick the same URL, so heuristics will still be necessary if “sameness” is important. You say itemid=”http://en.wikipedia.org/wiki/Sweden”, I say itemid=”http://sv.wikipedia.org/wiki/Sverige” and so on…

Comment by Manu Sporny

You’re reading too much into it at this point, Bruce – it’s simpler than that :).

@itemid and @about have nothing to do with the Web vocabularies that are used with them. They are a syntax thing – part of the language (Microdata/RDFa). They are part of the language just like the keyword ‘function’ is a part of JavaScript. It doesn’t matter what libraries (like jQuery) you use in your JavaScript program, ‘function’ will always exist as a part of the JS syntax.

Similarly, it doesn’t matter what Web vocabulary you’re using with RDFa or Microdata. You can always use @itemid or @about to be specific about the thing you’re talking about.

Comment by bruce

Manu, you said ” it doesn’t matter what Web vocabulary you’re using with RDFa or Microdata. You can always use @itemid or @about to be specific about the thing you’re talking about.”

But the microdata spec says only some things can have itemid attributes (my italics):

“Elements with an itemscope attribute and an itemtype attribute that references a vocabulary that is defined to support global identifiers for items may also have an itemid attribute specified”

which is the basis of my original questions: How do I know if a particular vocabulary supports global identifiers for items?

Comment by Andy Mabbett

Well, I knew all that. I blame Bruce for a badly-phrased question! ;-)

Seriously, resolving bug 13452 should help make the spec clearer, with follow-up from book authors and bloggers like, er, Bruce giving plain-English explanations and tutorials.

Comment by Philip Jägenstedt

You know if a vocabulary supports global identifiers if it explicitly says so somewhere in its documentation. That’s the case for vCard and vEvent, but not http://n.whatwg.org/work or schema.org (although they say they’re going to update the documentation).

A practical example of where itemid could be useful is in MusicBrainz. Let’s assume they were using microdata (they’re actually publishing RDFa) to mark up Queen’s album Hot Space. As you can tell from that page, the artist Queen is both the artist of the album itself, and one of the artists of Under Pressure.

Rather than just use <span itemprop=artist>Queen</span> for all of these, they could use <span itemprop=artist itemscope itemid="http://musicbrainz.org/artist/0383dadf-2a4e-4d10-a46a-e9e041da8eb3">Queen</span>. That way, there’s no ambiguity over which artist it is.

Or… you could just use MusicBrainz’ (excellent) XML API instead and treat the HTML as the presentation layer it is in this case.

Comment by Bruce

But I could mark up my albums using a different vocabulary, which used itemid=”http://en.wikipedia.org/wiki/Queen_(band)” as the identifier for Queen, and any Marvellous Sematic Mash-up-o-Matic device wouldn’t be able to know that your Queen and my Queen were the same.

(Can I just point out that this is an example and I do not own any Queen albums. Thank you)

Comment by Lin Clark

I think the problem is that too much focus has been on the automated part of these technologies, using logic to automate things. I don’t see this as the way the web is developing.

I think the microdata spec really gets it right when it talks about collaborating authors. If one group of people agrees that they are using MusicBrainz URLs to share information, then it doesn’t matter that another group of people use Wikipedia URLs. Each group still gets to do really cool things by sharing information within the group. Just like having a Facebook id makes it easy for content to be related to me in the Facebook network and having a G+ id makes it easy for content to be related to me in the G+ network.

There is definitely a social aspect here, I don’t buy the idea that once we have these tools in place the machines will do everything for us. It’s going to be groups of people agreeing to use a certain site’s IDs.

Comment by Yuri Petusko

imagine marking up a song that has same artist name & track name as some other song. Then itemid pointing to musicbrainz will help

Comment by c_alpha

Lin Clark has a good point here:

There is definitely a social aspect here, I don’t buy the idea that once we have these tools in place the machines will do everything for us. It’s going to be groups of people agreeing to use a certain site’s IDs.

That is just utterly true. Yet, another instance of such agreements can be public, open specifications. These enbale everyone to join such a group easily because they documented their agreements in the spec. But it doesn’t buy you out of making such agreements. It just eases it. So in that sense (using public specs) good part could made to feel like the machines were doing everything for us.

This is a basic finding about metadata/hyperdata that I keep making again and again in my work: the it’s the agreements that enable linking between metadata, that make the added value beyond the entity description alone.