Domain-specific markup for fun and profit
It doesn’t come as a surprise to Dull Old Web Farts (DOWFs) like me to learn last month that Google gives a search boost to sites that use structured data (as well as rewarding sites for being performant and mobile-friendly). Google has brilliant heuristics for analysing the content of sites, but developers being explicit and marking up their content using subject-specific vocabularies means more robust results.
For the first time (to my knowledge), Google has published some numbers on how structured data affects business. The headlines:
- Jobrapido’s overall organic traffic grew by 115%, and they have seen a 270% increase in new user registrations from organic traffic
- After the launch of job posting structured data, Google organic traffic to ZipRecruiter job pages converted at a rate three times higher than organic traffic from other search engines. The Google organic conversion rate on job pages was also more than 4.5 times higher than it had been previously, and the bounce rate for Google visitors to job pages dropped by over 10%.
- In the month following implementation, Eventbrite saw roughly a 100-percent increase in the typical year-over-year growth of traffic from Google Search
- Traffic to all Rakuten Recipe pages from search engines soared 2.7 times, and the average session duration was now 1.5 times longer than before.
Impressive, indeed. So how do you do it? For this site, I chose a vocabulary from schema.org:
These vocabularies cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model. Over 10 million sites use Schema.org to markup their web pages and email messages. Many applications from Google, Microsoft, Pinterest, Yandex and others already use these vocabularies to power rich, extensible experiences.
Because this is a blog, I chose the BlogPosting schema, and I use the HTML5 microdata syntax. So each article is marked up like this:
<article itemscope itemtype="http://schema.org/BlogPosting"> <header> <h2 itemprop="headline" id="post-11378">The HTML Treasure Hunt</h2> <time itemprop="dateCreated pubdate datePublished" datetime="2019-05-20">Monday 20 May 2019</time> </header> ... </article>
The values for the microdata attributes are specified in the schema vocabulary, except the
pubdate value on
itemprop which isn’t from schema.org, but is required by Apple for WatchOS because, well, Apple likes to be different.
And that’s basically it. All of this, of course, is taken care of by one WordPress template, so it’s automatic.
Metadata partial copy-paste necrosis for misery and loss
One thing puzzles me, however; Google documentation says that Google Search supports structured data in any of three formats: JSON-LD, RDFa and microdata formats, but notes “Google recommends using JSON-LD for structured data whenever possible”.
The markup is not interleaved with the user-visible text
I strongly feel that metadata that is separated from the user-visible data associated with it highly susceptible to metadata partial copy-paste necrosis. User-visible text is also developer-visible text. When devs copy/ paste that, it’s very easy to forget to copy any associated metadata that’s not interleaved, leading to errors. (And Google will penalise errors: structured data will not show up in search results if “The structured data is not representative of the main content of the page, or is potentially misleading”.)
An example of metadata partial copy-paste necrosis can be seen in the commonly-recommended accessible form pattern:
<label for="my-input">Your name:</label> <input id="my-input"/>
As Thomas Caspars wrote
I’ve lost track how many times I found broken ids, duplicate id/for, ids with two or more values and much more, so I prefer the implicit / wrapped variant.
— Klar Name (@tcaspers) May 5, 2019
I’ve contacted chums in Google to ask why JSON-LD is preferred, but had no reply. (I may go as far as trying to “reach out” next time.)
I’m pretty sure Google prefers JSON-LD over microdata because it’s easier for them to stealborrow the data for their own use in that format. When I was working on a screen-scraping project a few years ago, I found that to be the case. Since then, I’ve come to believe that schema.org is really about making it easier for the big guys to profit from data collection instead of helping site owners improve their SEO. But I’m probably just being a conspiracy theorist.
Speculation and conspiracy theories aside, until there’s a clear reason why I should use JSON-LD over interleaved microdata, I’m keeping it as it is.
Updated 23 May: Dan Brickley, a Google employee who is Lord of Schema.org, wrote this thread on Twitter:
1st conversations about https://t.co/ooIuC1elTy in JSON came up via Gmail's "smart mail" features, e.g. Flight boarding passes being marked up to show up in Google Now, smart watches etc.
— Dan Brickley (@danbri) May 22, 2019