It doesn’t come as a surprise to Dull Old Web Farts (DOWFs) like me to learn last month that Google gives a search boost to sites that use structured data (as well as rewarding sites for being performant and mobile-friendly). Google has brilliant heuristics for analysing the content of sites, but developers being explicit and marking up their content using subject-specific vocabularies means more robust results.
Jobrapido’s overall organic traffic grew by 115%, and they have seen a 270% increase in new user registrations from organic traffic
After the launch of job posting structured data, Google organic traffic to ZipRecruiter job pages converted at a rate three times higher than organic traffic from other search engines. The Google organic conversion rate on job pages was also more than 4.5 times higher than it had been previously, and the bounce rate for Google visitors to job pages dropped by over 10%.
In the month following implementation, Eventbrite saw roughly a 100-percent increase in the typical year-over-year growth of traffic from Google Search
Traffic to all Rakuten Recipe pages from search engines soared 2.7 times, and the average session duration was now 1.5 times longer than before.
Impressive, indeed. So how do you do it? For this site, I chose a vocabulary from schema.org:
These vocabularies cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model. Over 10 million sites use Schema.org to markup their web pages and email messages. Many applications from Google, Microsoft, Pinterest, Yandex and others already use these vocabularies to power rich, extensible experiences.
Because this is a blog, I chose the BlogPosting schema, and I use the HTML5 microdata syntax. So each article is marked up like this:
The values for the microdata attributes are specified in the schema vocabulary, except the pubdate value on itemprop which isn’t from schema.org, but is required by Apple for WatchOS because, well, Apple likes to be different.
And that’s basically it. All of this, of course, is taken care of by one WordPress template, so it’s automatic.
Metadata partial copy-paste necrosis for misery and loss
One thing puzzles me, however; Google documentation says that Google Search supports structured data in any of three formats: JSON-LD, RDFa and microdata formats, but notes “Google recommends using JSON-LD for structured data whenever possible”.
The markup is not interleaved with the user-visible text
I strongly feel that metadata that is separated from the user-visible data associated with it highly susceptible to metadata partial copy-paste necrosis. User-visible text is also developer-visible text. When devs copy/ paste that, it’s very easy to forget to copy any associated metadata that’s not interleaved, leading to errors. (And Google will penalise errors: structured data will not show up in search results if “The structured data is not representative of the main content of the page, or is potentially misleading”.)
An example of metadata partial copy-paste necrosis can be seen in the commonly-recommended accessible form pattern:
I’m pretty sure Google prefers JSON-LD over microdata because it’s easier for them to stealborrow the data for their own use in that format. When I was working on a screen-scraping project a few years ago, I found that to be the case. Since then, I’ve come to believe that schema.org is really about making it easier for the big guys to profit from data collection instead of helping site owners improve their SEO. But I’m probably just being a conspiracy theorist.
Speculation and conspiracy theories aside, until there’s a clear reason why I should use JSON-LD over interleaved microdata, I’m keeping it as it is.
Updated 23 May: Dan Brickley, a Google employee who is Lord of Schema.org, wrote this thread on Twitter:
1st conversations about https://t.co/ooIuC1elTy in JSON came up via Gmail's "smart mail" features, e.g. Flight boarding passes being marked up to show up in Google Now, smart watches etc.
As for HTML, there’s not much to learn right away and you can kind of learn as you go, but before making your first templates, know the difference between in-line elements like <span> and how they differ from block ones like <div>. This will save you a huge amount of headache when fiddling with your CSS code.
What is ‘good’ HTML?
In fact, part of what we commonly call ‘HTML5’ is the Parsing Algorithm which is like an HTML ninja – incredibly powerful, yet rarely noticed. It ensures that all modern browsers construct the same DOM from the same HTML, regardless of whether the HTML is well-formed or not. It’s the greatest fillip to interoperability we’ve ever seen.
By ‘good’ HTML, I mean semantic HTML, a posh term for choosing the right HTML element for the content. This isn’t a philosophical exercise; it has directly observable practical benefits.
For example, consider the <button> element. Using this gives you some browser behaviour for free:
A button is focusssable via the keyboard. I bet you, dear reader, know all the keyboard shortcuts for your IDE; it makes development much faster. Many people use only the keyboard when using a webpage. I do it because I have multiple sclerosis – therefore the fine motor control required to use a mouse can be difficult for me. My neighbour has arthritis, so she prefers to use the keyboard.
Buttons can be activated using the space bar or the enter key; you don’t have to remember to listen for these keypresses in your script.
Here’s another example: semantically linking a <label> to its associated <input> increases usability for a mouse-user or touch-screen user, because clicking in the label focusses into the input field. See this in action (and how to do it) on MDN.
This might be me, with my MS; it might be you, on a touch-screen device on a bumpy train, trying to check a checkbox. How much easier it is if the hit area also includes the label “uncheck to opt out of cancelling us not sending you spam forever”. (Compare the first and second identical-looking examples in a checkbox demo.)
But the point is that by choosing the right element for the job, you’re getting browser behaviour for free that makes your app more usable to a range of different people.
Invisible browser behaviours
With me so far? Good. The browser behaviours associated with the semantics of <button> and <label> are obvious once you know about them – because you can see them.
Other semantics aren’t so obvious to a sighted developer with a desktop machine and a nice big monitor, but they are incredibly useful for those who do need them. Let’s look at some of those.
HTML5 has some semantics that you can use for indicating regions on a page. For example, <nav>, <main>, <header>, <footer>.
If you wrap your main content – that is, the stuff that isn’t navigation, logo and main header etc – in a <main> tag, a screen reader user can jump immediately to it using a keyboard shortcut. Imagine how useful that is – they don’t have to listen to all the content before it, or tab through it to get to the main meat of your page.
And for people who don’t use a screenreader, that <main> element doesn’t get in the way. It has no default styling at all, so there’s nothing for you to remove. For those that need it, it simply works; for those that don’t need it, it’s entirely transparent.
Similarly, using <nav> for your primary navigation provides screenreader users with a shortcut key to jump to the navigation so they can continue exploring your marvellous site. You were probably going to wrap your navigation in a <div class=”nav”> to position it and style it; why not choose <nav> instead (it’s shorter!) and make your site more usable to the 15% of the world who have a disability?
Update: here’s a YouTube video of blind screenreader user Leonie Watson talking through how she navigates this site using the HTML semantics we’ve discussed.
New types of devices
We’re seeing more and more types of devices connecting to the web, and semantic HTML can help these devices display your content in a more usable way to their owners. And if your site is more usable than your competitors’, you win, and your boss will erect a massive gold statue of you in the office car park. (Trust me, your boss told me. They’ve already ordered the plinth.)
We’ve brought Reader to watchOS 5 where it automatically activates when following links to text heavy web pages. It’s important to ensure that Reader draws out the key parts of your web page by using semantic markup to reinforce the meaning and purpose of elements in the document. Let’s walk through an example. First, we indicate which parts of the page are the most important by wrapping it in an article tag.
Specifically, enclosing these header elements inside the article ensure that they all appear in Reader. Reader also styles each header element differently depending on the value of its itemprop attribute. Using itemprop, we’re able to ensure that the author, publication date, title, and subheading are prominently featured.
(If you plan to put things into microdata, please note that Apple, being Apple, go their own way, and don’t use a schema.org vocabulary here. Le sigh. See my article Content needs a publication date! for more. Or view source on this page to see how I’m using microdata on this article.)
Apple WatchOS also optimises display of items wrapped in <figure> elements:
Reader recognizes these tags and preserves their semantic styles. For this image, we use figure and figcaption elements to let the Reader know that the image is associated with the below caption. Reader then positions the image alongside its caption.
You probably know that HTML5 greatly increased the number of different <input> types, for example <input type=”email”> on a mobile device shows a keyboard with the “@” symbol and “.” that are in all email addresses; <input type=”tel”> on a mobile device shows a numeric keypad.
On desktop browsers, where you have a physical keyboard, you may get different User Interface benefits, or built-in validation. You don’t need to build any of this; you simply choose the right semantic that best expresses the meaning of your content, and the browser will choose the best display, depending on the device it’s on.
In WatchOS, input types take up the whole watch screen, so choosing the correct one is highly desirable.
First, choose the appropriate type attribute and element tag for your form controls.
WebKit supports a variety of form control types including passwords, numeric and telephone fields, date, time, and select menus. Choosing the most relevant type attribute allows WebKit to present the most appropriate interface to handle user input.
Secondly, it’s important to note that unlike iOS and macOS, input methods on watchOS require full-screen interaction. Label your form controls or specify aria label or placeholder attributes to provide additional context in the status bar when a full-screen input view is presented.
I didn’t choose to use <article> and itemprop and input types because I wanted to support the Apple Watch; I did it before the Apple Watch existed, in order that my code is future-proof. By choosing the right semantics now, a machine that I don’t know about yet can understand my content and display it in the best way for its users. You are opting out of this if you only use <div> or <span> because, by definition, they have “no special meaning at all”.
I hope I’ve shown you that choosing the correct HTML isn’t purely an academic exercise. Perhaps you can’t use all the above (and there’s much more that I haven’t discussed) but when you can, please consider whether there’s an HTML element that you could use to describe parts of your content.
For instance, when building components that emit banners and logos, consider wrapping them in <header> rather than <div>. If your styling relies upon classnames, <header class=”header”> will work just as well as <div class=”header”>.
Semantic HTML will give usability benefits to many users, help to future-proof your work, potentially boost your search engine results, and help people with disabilities use your site.
And, best of all, thinking more about your HTML will stop Dull Old Web Farts like me moaning at you.
What’s not to love? Have a splendid holiday season, whatever you celebrate – and here’s to a semantic 2019!
How to Learn CSS – Rachel Andrew’s on the core concepts of CSS that will help you work with CSS, rather than against it.
A new fight has broken out in specland, between the supporters of RDFa and supporters of microdata. Observers may be wondering why; both are methods of adding extra markup to existing content in order that machines may better understand the content. Semantic Web proponents (note capital letters) dream of a Web where all content is linked by said machines. Semantic Web sceptics have more humble aspirations of search engines better understanding micro-content (is this string of digits a book ISBN, or a phone number?).
RDFa was part of XHTML 2. It became a W3C standard (or, in their vernacular, a “recommendation”) in 2008. microdata was invented by Ian Hickson as part of HTML5 because he identified deficiencies in RDFa. microdata was subsequently modularised out of W3C HTML5, but microdata is part of HTML5; it validates, whereas RDFa doesn’t.
Note the history. Like football fights that break out because one guy called an opposing team fan’s pint “a pouff”, this isn’t about the actual slight at all; this is about the past, allegiances and alliances; it’s a clash of world views. This is XML versus non-XML; it’s the XHTML 2 gang against the uncouth young turks of HTML5. This is Rangers vs Celtic; it’s Blur vs Oasis; it’s Tiswas vs Swap Shop.
What follows is the observation of a layman; I’ve not used much structured content, so am not an expert (I once tried to use microformats for events at the Law Society, but their accessibility problems prevented it.)
In my opinion, the primary deficiencies of Classic RDFa are that it’s too hard to write. For professional metadata-ologists it may be simple (but, hey, those guys understand Dublin Core!). The difficulty for me as an HTML wrangler was namespacing, CURIEs, and triples. This is XML land, and most web authors are not particularly adept with XML.
There’s also the problem that in order to use RDFa properly, you needed an xmlns attribute which is separate from the content you’re actually marking up (you don’t anymore in RDFa 1.1, see Manu’s comment). In a world where lots of content is syndicated via machine, or copy and pasted by authors (many of whom don’t really understand what they’re copy and pasting), this leads to breakage as not all of the necessary moving parts get transferred to their new environment. Hixie wrote
Copy-and-paste of the source becomes very brittle when two separate parts of a document are needed to make sense of the content. Copy-and-paste is how the Web evolved, so I think it is important to keep it functional and easy.
microdata solves this problem. It’s also easier to write than Classic RDFa (in my opinion) although I’m still mystified by the itemid attribute. I intend to start using microdata on this site soon (in order to plug the holes left by removal of the HTML5 pubdate attribute).
I’ve been recommending that people use microdata. Its main advantages:
it’s the basis of schema.org vocabularies which is a consortium of Bing, Google, Yahoo and Yandex, so you get some SEO benefits
Manu Sporny understood the problem that RDFa is hard to author for those of us who find the best ontology is a don’t-ology. Almost a year ago, he set about simplifying RDFa and came up with RDFa Lite. RDFa Lite greatly simplifies RDFa; in fact, you can search and replace microdata terms with RDFa terms (see his post Mythical Differences: RDFa Lite vs. Microdata).
RDFa has multiple advantages, too:
it’s compatible with existing RDFa data on the Web (which is why it uses many of the same patterns as microdata but uses a different syntax)
you can use different vocabularies in the same item, which you can’t in microdata – see Jeni Tennison’s Using Multiple Vocabularies in Microdata. This allows you to support both schema.org and Facebook’s Open Graph Protocol (OGP) using a single markup language
you can easily switch to full-fat RDFa in the future if you feel the need
It also validates as HTML5 (thanks Gunnar, in the comments)
While I completely understand that two competing standards makes it harder for developers in the short term, I agree with Marcos Caceres (who isn’t a WHATWG/ HTML5 zealot) who counters Manu Sporny’s objection to microdata progressing thus:
I hope you will instead focus your energy on convincing the world that RDFa is the “correct technology” on its own merits and not place your bets on a mostly meaningless label (“Recommendation”) given by some (much loved, but) random standard organisation.
I see no technical reason to favour microdata or RDFa Lite; both do the job. So, developers; which tickles your fancies? RDFa Lite or microdata?
Sometimes, an item gives information about a topic that has a global identifier. For example, books can be identified by their ISBN number.
Vocabularies (as identified by the itemtype attribute) can be designed such that items get associated with their global identifier in an unambiguous way by expressing the global identifiers as URLs given in an itemid attribute.
The exact meaning of the URLs given in itemid attributes depends on the vocabulary used.
What actually does this mean? How do I know if a particular vocabulary supports global identifiers for items?
[example simplified from an example in the spec which says “The “http://vocab.example.net/book” vocabulary in this example would define that the itemid attribute takes a urn: URL pointing to the ISBN of the book.”]
over Schema.org’s Book schema (which doesn’t seem to use itemid – in fact, schema.org seems to make no mention of it):
While I was on my holidays there was a storm(ette) about rev=canonical and how it isn’t possible in HTML 5 because rev isn’t in the spec. (Apparently, the answer is to use rel=shortlink instead).
Mark Pilgrim published an article about link relations in HTML 5 with more information about the rel attribute, which I found interesting; I had no idea that relations such as rel=license and rel=author were available to allow auto-discovery of license information, and author details.
So I want to float the idea of rel=accessibility that would allow assistive technologies to discover and offer shortcuts to accessibility information, such as a WCAG 2 conformance claim, or a form to request content in alternate formats (for example).
The reason this would be useful is that links to such pages are generally right down in the footer of the web pages. This means that, for screenreader users, they have to navigate to the end of the page to find the link, or not know it exists.
Ironically, on sites that really do need a link to accessibility help (because of lack of structure to navigate with or huge amounts of content before the footer), those who need it are unlikely to find the link to the help.
In the “bad old days”, helpful developers would give an accesskey attribute to that link (which are generally undiscoverable to the human or to a parser, and which often conflict with assistive technologies’ command keystrokes).
A standardised way of indicating the related accessibility information would be better and not rely on arbitrary keys chosen by a developer.
So, should I propose that rel=accessibility be added to the list of values? It looks to be an arduous process; although you don’t need to prove your worth to the HTML 5 gatekeepers, you do have to prove your worth to the microformats gatekeepers.
I thought I’d ask you guys first—is this a good idea?
One such misuse is encoding dates and times in microformats such as hCalendar, hAtom, and hReview. Ultimately, this problem goes away in HTML 5, as that introduces a time element which is obviously better than an abbreviation for marking up dates and times (a tenet of microformats is to “use the most accurately precise semantic XHTML building block for each object”).
So, an example of an HTML 4 based hCalendar microformat (taken from the spec) is
<a class="url" href="http://www.web2con.com/">http://www.web2con.com/</a>
<span class="summary">Web 2.0 Conference</span>:
<abbr class="dtstart" title="2007-10-05">October 5</abbr>-
<abbr class="dtend" title="2007-10-20">19</abbr>,
at the <span class="location">Argent Hotel, San Francisco, CA</span>
After replacing the abbr element with time and replacing its title attribute with datetime we get
<a class="url" href="http://www.web2con.com/">http://www.web2con.com/</a>
<span class="summary">Web 2.0 Conference</span>:
<time class="dtstart" datetime="2007-10-05">October 5</time>-
<time class="dtend" datetime="2007-10-20">19</time>,
at the <span class="location">Argent Hotel, San Francisco, CA</span>
“Thirty point three oh oh four seven four semi-colon minus ninety-seven point seven four seven two four seven”
A leading microformatter, Ben Ward, recently proposed an extension to the value-excerption pattern (no, I don’t know what that means, either) which allows this machine-readable information to remain in the DOM while hiding machine date from people.
My test page plays nicely with the following screenreaders:
JAWS 9 and JAWS 10 on Firefox 3 and IE 7 with high verbosity settings and abbr and acronym set to always be expanded (says Jared Smith)
WinXP Window-Eyes with FF3 and full punc reads human date. It pauses before the .dtstart spans in Mouse mode, but not in Browse mode (thanks dotjay)
Safari 3.2.1 on OS X Leopard with VoiceOver (thanks dotjay)
WinXP Window-Eyes with IE 6 and full punc reads human date (thanks dotjay)
I’m really excited by this as it may be the end of the microformats versus accessibility debates (that I’ve helped stir up). If you have access to assistive techologies, please give it a test.
Problems with this pattern from a microformats perspective might be
The machine data is “hidden” so might more easily fall out of sync with the human data
The machine data must be the first child of the property. If it isn’t, a parser won’t see it – but it will be trickier to debug because the developer will still see it in the source code
I hope, if screenreader and parser testing allows, that the new pattern will be adopted so that those of us who want to use microformats and care about accessibility can use it.
The future of microformats in HTML 5
I had naively thought that many microformats would use the HTML data-* attributes, which are for “embedding custom non-visible data” and thus seem perfect for embedding such information on structutal markup.
Any element in HTML 5 can have any number of data-* attributes. The asterisk is a wildcard; you can call them what you want. An example from the spec:
User agents must not derive any implementation behavior from these attributes or values. Specifications intended for user agents must not define these attributes to have any meaningful values.
I was uncertain what this meant, so asked Anne van Kesteren who told me that these attributes are for passing data to scripts that are private to the page, rather than to indicate meaning to external parsers:
It’s so that non-private extensions that need User Agent implementation a) won’t break sites using those names for other purposes and b) get due consideration by the Working Group and a proper name without data-
This is because these attributes are intended for use by the site’s own scripts, and are not a generic extension mechanism for publicly-usable metadata.
So, microformats won’t be “rolled up” into HTML 5. I imagine that some of the microformats community will wish to lobby the HTML 5 working group for a proper name for the data they wish to store with an element, so that the data can be parsed reliably without the “hacky” nature of the microformats. There is a process for adding new features to the spec.
Others will want to ignore caveats that HTML 5 places on using the data-* attributes for publicly-usable metadata and use them anyway.
Perhaps most likely, microformats will continue to use class=, rel=, and the like as they do now. That would be valid HTML 5 and require no changes to specs or parsers.
So it seems that microformats will continue in an HTML 5 world. And, now that there seems to be a will to fix the accessibility problems, I think that’s a good thing.