Archive for the 'accessibility web standards' Category

Making accessible tagged PDFs with Prince

Love them or hate them, PDFs are a fact of life for many organisations. If you produce PDFs, you should make them accessible to people with disabilities. With Prince, (twitter) it’s easy to produce accessible, tagged PDFs from semantic HTML, CSS and SVG.

It’s an enduring myth that PDF is an inaccessible format. In 2012, the PDF profile PDF/UA (for ‘Universal Accessibility’) was standardised. It’s the U.S. Library of Congress’ preferred format for page-oriented content and the International Standard for accessible PDF technology, ISO 14289.

Let’s look at how to make accessible PDFs with Prince. Even if you already have Prince installed, grab the latest build (think of it as a stable beta for the next version) and install it; it’s a free license for non-commercial use. Prince is available for Windows, Mac, Linux, Free BSD desktops and wrappers are available for Java, C#/ .NET, ActiveX/COM, PHP, Ruby on Rails and Node/ JavaScript for integrating Prince into websites and applications.

Here’s a trivial HTML file, which I’ve called prince1.html.

<!DOCTYPE html>
<html>
<meta charset=utf-8>
<title>My lovely PDF</title>
<style>
        h1 {color:red;}
        p {color:green;}
</style>
<h1>Lovely heading</h1>
<p>Marvellous paragraph!</p>
</html>

From the command line, type

$ prince prince1.html

Prince has produced prince1.pdf in the same folder. (There are many command line switches to choose the name of the output file, combine files into a single PDF etc., but that’s not relevant here. Windows fans can also use a GUI.)

Using Adobe Acrobat Pro, I can inspect the tag structure of the PDF produced:

Acrobat screenshot: no tags available

As you can see, Acrobat reports “No Tags available”. This is because it’s perfectly legitimate to make inaccessible PDFs – documents intended only for printing, for example. So let’s tell Prince to make a tagged PDF:

$ prince prince1.html --tagged-pdf

Inspecting this file in Acrobat shows the tag structure:

Acrobat screenshot showing tags

Now we can see that under the <Document> tag (PDF’s equivalent of a <body> element), we have an <H1> and a <P>. Yes, PDF tags often —but not always— have the same name as their HTML counterparts. As Adobe says

PDF tags are similar to tags used in HTML to make Web pages more accessible. The World Wide Web Consortium (W3C) did pioneering work with HTML tags to incorporate the document structure that was needed for accessibility as the HTML standard evolved.

However, the fact that the PDF now has structural tags doesn’t mean it’s accessible. Let’s try making a PDF with the PDF-UA profile:

$ prince prince1.html --pdf-profile="PDF/UA-1"

Prince aborts, giving the error “prince: error: PDF/UA-1 requires language specification”. This is because our HTML page is missing the lang attribute on the HTML element, which tells assistive technologies which language the text is written in. This is very important to screen reader users, for example; the pronunciation of the word “six” is very different in English and French.

Unfortunately, this is a very common error on the Web; WebAIM recently analysed the accessibility of the top 1,000,000 home pages and discovered that a whopping 97.8% of home pages had detectable accessibility failures. A missing language specification was the fifth most common error, affecting 33% of sites.

screenshot from webaim showing most common accessibility errors on top million homepages
Image courtesy of webaim.org, © WebAIM, used by kind permission

Let’s fix our web page by amending the HTML element to read <html lang=en>.

Now it princifies without errors. Inspecting it in Acrobat Pro, we see a new <Annot> tag has appeared. Right-clicking on it in the tag inspector reveals it to be the small Prince logo image (that all free licenses generate), with alternate text “This document was created with Prince, a great way of getting web content onto paper”:

Acrobat screenshot with annotation on the Prince logo added with free licenses

This generation of the <Annot> with alternate text, and checking that the document’s language is specified allows us to produce a fully-accessible PDF, which is why we generally advise using the --pdf-profile="PDF/UA-1" command line switch rather than --tagged-pdf.

Adobe maintains a list of Standard PDF tags, most of which can easily be mapped by Prince to HTML counterparts.

Customising Prince’s default mappings

Prince can’t always map HTML directly to PDF tags. This could be because there isn’t a direct counterpart in HTML, or it could be because the source markup has conflicting markup and styling.

Let’s look at the first scenario. HTML has a <main> element, which doesn’t have a one-to-one correspondence with a single PDF tag. On many sites, there is one article per document (a wikipedia entry, for example), and it’s wrapped by a <main> element, or some other element serving to wrap the main content.

Let’s look at the wikipedia article for stegosaurus, because it is the best dinosaur.

We can see from browser developer tools that this article’s content is wrapped with <div id=”bodyContent”>. We can tell Prince to map this to the PDF <Art> tag, defined as “Article element. A self-contained body of text considered to be a single narrative” by adding a declaration in our stylesheet:

#bodyContent { prince-pdf-tag-type: Art; }

On another site, we might want to map the <main> element to <Art>. The same method applies:

Main { prince-pdf-tag-type: Art;}

Different authors’ conventions over the years is one reason why Prince can’t necessarily map everything automatically (although, by default HTML <article> gets mapped to <Art>).

Therefore, in this new build of PrinceXML, much of the mapping of HTML elements to PDF tags has been removed from the logic of Prince, and into the default stylesheet html.css in the style sub-folder. This makes it clearer how Prince maps HTML elements to PDF tags, and allows the author to override or customise it if necessary.

Here is the relevant section of the default mappings:

article { prince-pdf-tag-type: Art }
section { prince-pdf-tag-type: Sect }
blockquote { prince-pdf-tag-type: BlockQuote }
h1 { prince-pdf-tag-type: H1 }
h2 { prince-pdf-tag-type: H2 }
h3 { prince-pdf-tag-type: H3 }
h4 { prince-pdf-tag-type: H4 }
h5 { prince-pdf-tag-type: H5 }
h6 { prince-pdf-tag-type: H6 }
ol { prince-pdf-tag-type: OL }
ul { prince-pdf-tag-type: UL }
li { prince-pdf-tag-type: LI }
dl { prince-pdf-tag-type: DL }
dl > div { prince-pdf-tag-type: DL-Div }
dt { prince-pdf-tag-type: DT }
dd { prince-pdf-tag-type: DD }
figure { prince-pdf-tag-type: Div } /* figure grouper */
figcaption { prince-pdf-tag-type: Caption }
p { prince-pdf-tag-type: P }
q { prince-pdf-tag-type: Quote }
code { prince-pdf-tag-type: Code }
img, input[type="image"] {
prince-pdf-tag-type: Figure;
prince-alt-text: attr(alt);
}
abbr, acronym {
prince-expansion-text: attr(title)
}

There are also two new properties, prince-alt-text and prince-expansion-text, which can be overridden to support the relevant ARIA attributes.

Uncle Hakon shouting at me in Paris
Uncle Håkon shouting at me last month in Paris

Taking our lead from wikipedia again, we might want to produce a PDF table of contents from the ‘Contents’ box. Here is the Contents for the entry about otters (which are the best non-dinosaurs):

screenshot of wikipedia's in-page table of contents

The box is wrapped in an unordered list inside a <div id=”toc”>. To make this into a PDF Table of Contents (<TOC>), I add these lines to Prince’s HTML.css (because obviously I can’t touch the wikipedia source files):

#toc ul {prince-pdf-tag-type: TOC;} /*Table of Contents */
#toc li {prince-pdf-tag-type: TOCI;} /* TOC item */

This produces the following tag structure:

Acrobat screenshot showing PDF table of contents based on the wikipedia table of contents

In one of my personal sites, I use HTML <nav> as the wrapper for my internal navigation, so would use these declaration instead:

nav ul {prince-pdf-tag-type: TOC;}
nav li {prince-pdf-tag-type: TOCI;}

Only internal links are appropriate for a PDF Table of Contents, which is why Prince can’t automatically map <nav> to <TOC> but makes it easy for you to do so, either by editing html.css directly, or by pulling in a supplementary stylesheet.

Mapping when semantic and styling conflict

There are a number of tricky questions when it comes to tagging when markup and style conflict. For example, consider this markup which is used to “fake” a bulleted list visually:


<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<title>My lovely PDF</title>
<style>
div div {display:list-item;
    list-style-type: disc;
    list-style-position: inside;}
</style>
<div>

    <div>One</div>
    <div>Two</div>
    <div>Three</div>

</div>

Browsers render it something like this:

what looks like a bulleted list in a browser

But this merely looks like a bulleted list — it isn’t structurally anything other than three meaningless <div>s. If you need this to be tagged in the output PDF as a list (so a screen reader user can use a keyboard short cut to jump from list to list, for example), you can use these lines of CSS:

body>div {prince-pdf-tag-type: UL;}
div div {prince-pdf-tag-type: LI;}

Prince creates custom OL-L and UL-L tags which are role-mapped to PDF’s list structure tag <L>. Prince also sets the ListNumbering attribute when it can infer it.

Mapping ARIA roles

Often, developers supplement their HTML with ARIA roles. This can be particularly useful when retrofitting legacy markup to be accessible, especially when that markup contains few semantic elements — the usual example is adding role=button to a set of nested <div>s that are styled to look like a button.

Prince does not do anything special with ARIA roles, partly because, as webaim reports,

they are often used to override correct HTML semantics and thus present incorrect information or interactions to screen reader users

But by supplementing Prince’s html.css, an author can map elements with specific ARIA roles to PDF tags. For example, if your webpage has many <div role=”article”> you can map these to pdf <Art> tags thus:

div[role="article"] {prince-pdf-tag-type: Art;}

Conclusion

As with HTML, the more structured and semantic the markup is, the better the output will be. But of course, Prince cannot verify that alternate text is an accurate description of the function of an image. Ultimately claiming that a document meets the PDF/UA-1 profile actually requires some human review, so Prince has to trust that the author has done their part in terms of making the input intelligible. Using Prince, it’s very easy to turn long documents —even whole books— into accessible and attractive PDFs.

Reading List 233

Want my reading lists sent straight to your inbox? Sign up and Mr Mailchimp will send it your way!

A short note on HTML5 article, section and hgroup

This conference season I’ve spoken at some events for non-frontenders, suggesting that people invest time in learning the semantics of HTML. After all, there are only 120(ish) elements; the average two year old knows 100 words and by the time a child is three will have a vocabulary of over 300 words.

A few people asked me the difference between <article> and <section>. My reply: don’t worry. Simply, don’t use <section>. its only use is in the HTML Document Outline Algorithm, which isn’t implemented anywhere, and seemingly never will be. For the same reason, don’t worry about the <hgroup> element.

But do use <article>, and not just for blog posts/ news stories. It’s not just for articles in the news sense, it’s for discrete self-contained things. Think “article of clothing”, not “magazine article”. So a list of videos should have each one (and its description) wrapped in an <article>. A list of products, similarly. Consider adding microdata from schema.org, as that will give you better search engine results and look better on Apple watches.

And, of course, do use <main>, <nav>, <header> and <footer>. It’s really useful for screen reader users – see my article The practical value of semantic HTML.

Happy marking up!

Why would a screen reader user have a braille display?

Last week, I was invited to address the annual conference of the UK Association for Accessible Formats. I found myself sitting next to a man with these two refreshable braille displays, so I asked him what the difference is.

Two similar refreshable braille displays, side by side

On the left is his old VarioUltra 20, which can connect to devices via USB, Bluetooth, and can take a 32MS SD card, for offline use (reading a book, for example). It’s also a note-taker. He told me it cost around £2500. On the right is his new Orbit Reader 20, “the world’s most affordable Refreshable Braille Display” with similar functionality, which costs £500.

As he wasn’t deaf-blind, I asked why he uses such expensive equipment, when devices have built-in free screen readers. One of his reasons was, in retrospect, so blazingly obvious, and so human.

He likes to read his kids bedtime stories. With the braille display, he can read without a synthesised voice in his ear. Therefore, he could do all the characters’ voices himself to entertain his children.

My take-home from this: Of course free screen readers are an enormous boon, but each person has their own reasons for choosing their assistive technologies. Accessibility isn’t a technological problem to be solved. It’s an essential part of the human condition: we all have different needs and abilities.

Reading List 232

Want my reading lists sent straight to your inbox? Sign up and Mr Mailchimp will send it your way!

Reading List 231

Structured data and Google

Domain-specific markup for fun and profit

It doesn’t come as a surprise to Dull Old Web Farts (DOWFs) like me to learn last month that Google gives a search boost to sites that use structured data (as well as rewarding sites for being performant and mobile-friendly). Google has brilliant heuristics for analysing the content of sites, but developers being explicit and marking up their content using subject-specific vocabularies means more robust results.

For the first time (to my knowledge), Google has published some numbers on how structured data affects business. The headlines:

  • Jobrapido’s overall organic traffic grew by 115%, and they have seen a 270% increase in new user registrations from organic traffic
  • After the launch of job posting structured data, Google organic traffic to ZipRecruiter job pages converted at a rate three times higher than organic traffic from other search engines. The Google organic conversion rate on job pages was also more than 4.5 times higher than it had been previously, and the bounce rate for Google visitors to job pages dropped by over 10%.
  • In the month following implementation, Eventbrite saw roughly a 100-percent increase in the typical year-over-year growth of traffic from Google Search
  • Traffic to all Rakuten Recipe pages from search engines soared 2.7 times, and the average session duration was now 1.5 times longer than before.

Impressive, indeed. So how do you do it? For this site, I chose a vocabulary from schema.org:

These vocabularies cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model. Over 10 million sites use Schema.org to markup their web pages and email messages. Many applications from Google, Microsoft, Pinterest, Yandex and others already use these vocabularies to power rich, extensible experiences.

Because this is a blog, I chose the BlogPosting schema, and I use the HTML5 microdata syntax. So each article is marked up like this:

<article itemscope itemtype="http://schema.org/BlogPosting">
  <header>
  <h2 itemprop="headline" id="post-11378">The HTML Treasure Hunt</h2>
  <time itemprop="dateCreated pubdate datePublished" 
    datetime="2019-05-20">Monday 20 May 2019</time>
  </header>
    ...
</article>

The values for the microdata attributes are specified in the schema vocabulary, except the pubdate value on itemprop which isn’t from schema.org, but is required by Apple for WatchOS because, well, Apple likes to be different.

And that’s basically it. All of this, of course, is taken care of by one WordPress template, so it’s automatic.

Metadata partial copy-paste necrosis for misery and loss

One thing puzzles me, however; Google documentation says that Google Search supports structured data in any of three formats: JSON-LD, RDFa and microdata formats, but notes “Google recommends using JSON-LD for structured data whenever possible”.

However, no reason is given for preferring JSON-LD except “Google can read JSON-LD data when it is dynamically injected into the page’s contents, such as by JavaScript code or embedded widgets in your content management system”. I guess this could be an advantage, but one of the other “features” of JSON-LD is, in my opinion, a bug:

The markup is not interleaved with the user-visible text

I strongly feel that metadata that is separated from the user-visible data associated with it highly susceptible to metadata partial copy-paste necrosis. User-visible text is also developer-visible text. When devs copy/ paste that, it’s very easy to forget to copy any associated metadata that’s not interleaved, leading to errors. (And Google will penalise errors: structured data will not show up in search results if “The structured data is not representative of the main content of the page, or is potentially misleading”.)

An example of metadata partial copy-paste necrosis can be seen in the commonly-recommended accessible form pattern:

<label for="my-input">Your name:</label>
<input id="my-input"/>

As Thomas Caspars wrote

I’ve contacted chums in Google to ask why JSON-LD is preferred, but had no reply. (I may go as far as trying to “reach out” next time.)

Andrew wrote

I’m pretty sure Google prefers JSON-LD over microdata because it’s easier for them to stealborrow the data for their own use in that format. When I was working on a screen-scraping project a few years ago, I found that to be the case. Since then, I’ve come to believe that schema.org is really about making it easier for the big guys to profit from data collection instead of helping site owners improve their SEO. But I’m probably just being a conspiracy theorist.

Speculation and conspiracy theories aside, until there’s a clear reason why I should use JSON-LD over interleaved microdata, I’m keeping it as it is.

Google replies

Updated 23 May: Dan Brickley, a Google employee who is Lord of Schema.org, wrote this thread on Twitter:

The HTML Treasure Hunt

Here are my slides for The HTML Treasure Hunt, my keynote at the International JavaScript Conference last week. They probably don’t make much sense on their own, unfortunately, as I use slides as pointers for me to ramble about the subject, but a video is coming soon, and I’ll post it here.

Update! Here’s the video! Starts at 18:08.

YouTube video

Given that one of my themes was “write less JS and more HTML”, feedback was great! Attendees gave me 4.8 out of 5 for “Quality of the presentation” (against a conference average of 4.0) and 4.9 for “Speaker’s knowledge of the subject” (against an average of 4.5). Comments included:

great talk! reminding of the basics we often forget.

amazing way to start the second day of this conference. inspiring to say the least. great job Bruce

very entertaining and great message. excellent speaker

Thanks, that was a talk a lot of us needed.

Remarkable presentation. Thought provoking, backed with statistics. Well presented.

Very experienced and inspiring speaker. I would really like to incorporate this new ideas (for me) in my code

I think there’s a room full of people going to re-learn HTML after that inspiring talk!

If you’d like me to talk at your event, give training at your organisation, or help CTO your next development project, get in touch!

Reading List 230

Reading List 229