Bruce Lawson's personal site

Structured data and Google

Domain-specific markup for fun and profit

It doesn’t come as a surprise to Dull Old Web Farts (DOWFs) like me to learn last month that Google gives a search boost to sites that use structured data (as well as rewarding sites for being performant and mobile-friendly). Google has brilliant heuristics for analysing the content of sites, but developers being explicit and marking up their content using subject-specific vocabularies means more robust results.

For the first time (to my knowledge), Google has published some numbers on how structured data affects business. The headlines:

Impressive, indeed. So how do you do it? For this site, I chose a vocabulary from schema.org:

These vocabularies cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model. Over 10 million sites use Schema.org to markup their web pages and email messages. Many applications from Google, Microsoft, Pinterest, Yandex and others already use these vocabularies to power rich, extensible experiences.

Because this is a blog, I chose the BlogPosting schema, and I use the HTML5 microdata syntax. So each article is marked up like this:

<article itemscope itemtype="http://schema.org/BlogPosting">
  <header>
  <h2 itemprop="headline" id="post-11378">The HTML Treasure Hunt</h2>
  <time itemprop="dateCreated pubdate datePublished" 
    datetime="2019-05-20">Monday 20 May 2019</time>
  </header>
    ...
</article>

The values for the microdata attributes are specified in the schema vocabulary, except the pubdate value on itemprop which isn’t from schema.org, but is required by Apple for WatchOS because, well, Apple likes to be different.

And that’s basically it. All of this, of course, is taken care of by one WordPress template, so it’s automatic.

Metadata partial copy-paste necrosis for misery and loss

One thing puzzles me, however; Google documentation says that Google Search supports structured data in any of three formats: JSON-LD, RDFa and microdata formats, but notes “Google recommends using JSON-LD for structured data whenever possible”.

However, no reason is given for preferring JSON-LD except “Google can read JSON-LD data when it is dynamically injected into the page’s contents, such as by JavaScript code or embedded widgets in your content management system”. I guess this could be an advantage, but one of the other “features” of JSON-LD is, in my opinion, a bug:

The markup is not interleaved with the user-visible text

I strongly feel that metadata that is separated from the user-visible data associated with it highly susceptible to metadata partial copy-paste necrosis. User-visible text is also developer-visible text. When devs copy/ paste that, it’s very easy to forget to copy any associated metadata that’s not interleaved, leading to errors. (And Google will penalise errors: structured data will not show up in search results if “The structured data is not representative of the main content of the page, or is potentially misleading”.)

An example of metadata partial copy-paste necrosis can be seen in the commonly-recommended accessible form pattern:

<label for="my-input">Your name:</label>
<input id="my-input"/>

As Thomas Caspars wrote

I’ve contacted chums in Google to ask why JSON-LD is preferred, but had no reply. (I may go as far as trying to “reach out” next time.)

Andrew wrote

I’m pretty sure Google prefers JSON-LD over microdata because it’s easier for them to stealborrow the data for their own use in that format. When I was working on a screen-scraping project a few years ago, I found that to be the case. Since then, I’ve come to believe that schema.org is really about making it easier for the big guys to profit from data collection instead of helping site owners improve their SEO. But I’m probably just being a conspiracy theorist.

Speculation and conspiracy theories aside, until there’s a clear reason why I should use JSON-LD over interleaved microdata, I’m keeping it as it is.

Google replies

Updated 23 May: Dan Brickley, a Google employee who is Lord of Schema.org, wrote this thread on Twitter:

5 Responses to “ Structured data and Google ”

Comment by Eric A. Meyer

We ended up using JSON-LD for the An Event Apart web site because everything is being spat out of a CMS, so it was easier to output the structured data in separate blocks than try to interleave it in the markup. In more manually-maintained scenarios, I’d very likely go the interleaving route for the reasons you describe.

Probably unrelated: your RSS feed is only including the article paths instead of absolute URLs, like this: <link>/2019/the-html-treasure-hunt/</link>. Which confuses my RSS reader (NetNewsWire 3!) to the point that it passes just “/2019/the-html-treasure-hunt/” to the browser, which understandably doesn’t know what to do next.

Comment by Charlie

“Microdata” solves nothing and introduces its own problems; in fact opinion it reintroduces some of the problems that HTML5 set out to solve through EAV (Entity Attribute Values).

In contrast, JSON-LD can be validated and loaded directly into database and queries immediately, while microdata will have to be detected and harvested from the DOM, presumably as JSON, before any processing can happen.

Anything that sticks meaningful data in HTML attributes that are invisible in the browser is inviting necrosis, not least because there is no validation step. I still can’t get over the fact that we never got CSS/locale support for time tags.

Comment by jamrok

So will I be penalized for saying following:

JSON-LD: the X information in on the page
HTML: show the X information on :hover in CSS dropdown menu

The JSON-LD-specified information is visible to the user only after user interaction, so it is possible for Google Bot to “discover” you’re lying.

On the other hand I could do:
JSON-LD: the X information in on the page
HTML: show the X information, transparency: 95%

and this way I could cheat SEO.

Don’t Repeat Yourself seems more robust here.

Comment by James Nash

FWIW, I prefer the interleaved nature of microdata or RDFa (and between those two I prefer RDFa) to the separated nature JSON-LD. To my mind it’s like an extra enhancent on top of my semantic HTML so it feels logical to have them close to each other in the code. 🙂

That being said, I can appreciate some of the reasons others like JSON-LD.

I believe all have their uses and should be supported and I hope Google continue to do so. So, it does concern me a tad when Google expresses a preference for one of them – given their history of discontinuing products and services…

Comment by Jason

I’m curious, what about Microformats? Why does it seem like this open standard is never considered for this type of application? It is in the same vein of structured data markup options, isn’t it? I’ve never understood why it has always been left out of the discussion. I’m curious to know if anyone can explain why.

Leave a Reply

HTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> . To display code, manually escape it.