Bruce Lawson's personal site

Reading List 236

Want my reading lists sent straight to your inbox? Sign up and Mr Mailchimp will send it your way.

Reading List 235

Want my reading lists sent straight to your inbox? Sign up and Mr Mailchimp will send it your way!

Reading List 234

Want my reading lists sent straight to your inbox? Sign up and Mr Mailchimp will send it your way!

Making accessible tagged PDFs with Prince

Love them or hate them, PDFs are a fact of life for many organisations. If you produce PDFs, you should make them accessible to people with disabilities. With Prince, it’s easy to produce accessible, tagged PDFs from semantic HTML, CSS and SVG.

It’s an enduring myth that PDF is an inaccessible format. In 2012, the PDF profile PDF/UA (for ‘Universal Accessibility’) was standardised. It’s the U.S. Library of Congress’ preferred format for page-oriented content and the International Standard for accessible PDF technology, ISO 14289.

Let’s look at how to make accessible PDFs with Prince. Even if you already have Prince installed, grab the latest build (think of it as a stable beta for the next version) and install it; it’s a free license for non-commercial use. Prince is available for Windows, Mac, Linux, Free BSD desktops and wrappers are available for Java, C#/ .NET, ActiveX/COM, PHP, Ruby on Rails and Node/ JavaScript for integrating Prince into websites and applications.

Here’s a trivial HTML file, which I’ve called prince1.html.

<!DOCTYPE html>
<html>
<meta charset=utf-8>
<title>My lovely PDF</title>
<style>
        h1 {color:red;}
        p {color:green;}
</style>
<h1>Lovely heading</h1>
<p>Marvellous paragraph!</p>
</html>

From the command line, type

$ prince prince1.html

Prince has produced prince1.pdf in the same folder. (There are many command line switches to choose the name of the output file, combine files into a single PDF etc., but that’s not relevant here. Windows fans can also use a GUI.)

Using Adobe Acrobat Pro, I can inspect the tag structure of the PDF produced:

Acrobat screenshot: no tags available

As you can see, Acrobat reports “No Tags available”. This is because it’s perfectly legitimate to make inaccessible PDFs – documents intended only for printing, for example. So let’s tell Prince to make a tagged PDF:

$ prince prince1.html --tagged-pdf

Inspecting this file in Acrobat shows the tag structure:

Acrobat screenshot showing tags

Now we can see that under the <Document> tag (PDF’s equivalent of a <body> element), we have an <H1> and a <P>. Yes, PDF tags often —but not always— have the same name as their HTML counterparts. As Adobe says

PDF tags are similar to tags used in HTML to make Web pages more accessible. The World Wide Web Consortium (W3C) did pioneering work with HTML tags to incorporate the document structure that was needed for accessibility as the HTML standard evolved.

However, the fact that the PDF now has structural tags doesn’t mean it’s accessible. Let’s try making a PDF with the PDF-UA profile:

$ prince prince1.html --pdf-profile="PDF/UA-1"

Prince aborts, giving the error “prince: error: PDF/UA-1 requires language specification”. This is because our HTML page is missing the lang attribute on the HTML element, which tells assistive technologies which language the text is written in. This is very important to screen reader users, for example; the pronunciation of the word “six” is very different in English and French.

Unfortunately, this is a very common error on the Web; WebAIM recently analysed the accessibility of the top 1,000,000 home pages and discovered that a whopping 97.8% of home pages had detectable accessibility failures. A missing language specification was the fifth most common error, affecting 33% of sites.

screenshot from webaim showing most common accessibility errors on top million homepages
Image courtesy of webaim.org, © WebAIM, used by kind permission

Let’s fix our web page by amending the HTML element to read <html lang=en>.

Now it princifies without errors. Inspecting it in Acrobat Pro, we see a new <Annot> tag has appeared. Right-clicking on it in the tag inspector reveals it to be the small Prince logo image (that all free licenses generate), with alternate text “This document was created with Prince, a great way of getting web content onto paper”:

Acrobat screenshot with annotation on the Prince logo added with free licenses

This generation of the <Annot> with alternate text, and checking that the document’s language is specified allows us to produce a fully-accessible PDF, which is why we generally advise using the --pdf-profile="PDF/UA-1" command line switch rather than --tagged-pdf.

Adobe maintains a list of Standard PDF tags, most of which can easily be mapped by Prince to HTML counterparts.

Customising Prince’s default mappings

Prince can’t always map HTML directly to PDF tags. This could be because there isn’t a direct counterpart in HTML, or it could be because the source markup has conflicting markup and styling.

Let’s look at the first scenario. HTML has a <main> element, which doesn’t have a one-to-one correspondence with a single PDF tag. On many sites, there is one article per document (a wikipedia entry, for example), and it’s wrapped by a <main> element, or some other element serving to wrap the main content.

Let’s look at the wikipedia article for stegosaurus, because it is the best dinosaur.

We can see from browser developer tools that this article’s content is wrapped with <div id=”bodyContent”>. We can tell Prince to map this to the PDF <Art> tag, defined as “Article element. A self-contained body of text considered to be a single narrative” by adding a declaration in our stylesheet:

#bodyContent { prince-pdf-tag-type: Art; }

On another site, we might want to map the <main> element to <Art>. The same method applies:

Main { prince-pdf-tag-type: Art;}

Different authors’ conventions over the years is one reason why Prince can’t necessarily map everything automatically (although, by default HTML <article> gets mapped to <Art>).

Therefore, in this new build of PrinceXML, much of the mapping of HTML elements to PDF tags has been removed from the logic of Prince, and into the default stylesheet html.css in the style sub-folder. This makes it clearer how Prince maps HTML elements to PDF tags, and allows the author to override or customise it if necessary.

Here is the relevant section of the default mappings:

article { prince-pdf-tag-type: Art }
section { prince-pdf-tag-type: Sect }
blockquote { prince-pdf-tag-type: BlockQuote }
h1 { prince-pdf-tag-type: H1 }
h2 { prince-pdf-tag-type: H2 }
h3 { prince-pdf-tag-type: H3 }
h4 { prince-pdf-tag-type: H4 }
h5 { prince-pdf-tag-type: H5 }
h6 { prince-pdf-tag-type: H6 }
ol { prince-pdf-tag-type: OL }
ul { prince-pdf-tag-type: UL }
li { prince-pdf-tag-type: LI }
dl { prince-pdf-tag-type: DL }
dl > div { prince-pdf-tag-type: DL-Div }
dt { prince-pdf-tag-type: DT }
dd { prince-pdf-tag-type: DD }
figure { prince-pdf-tag-type: Div } /* figure grouper */
figcaption { prince-pdf-tag-type: Caption }
p { prince-pdf-tag-type: P }
q { prince-pdf-tag-type: Quote }
code { prince-pdf-tag-type: Code }
img, input[type="image"] {
prince-pdf-tag-type: Figure;
prince-alt-text: attr(alt);
}
abbr, acronym {
prince-expansion-text: attr(title)
}

There are also two new properties, prince-alt-text and prince-expansion-text, which can be overridden to support the relevant ARIA attributes.

Uncle Hakon shouting at me in Paris
Uncle Håkon shouting at me last month in Paris

Taking our lead from wikipedia again, we might want to produce a PDF table of contents from the ‘Contents’ box. Here is the Contents for the entry about otters (which are the best non-dinosaurs):

screenshot of wikipedia's in-page table of contents

The box is wrapped in an unordered list inside a <div id=”toc”>. To make this into a PDF Table of Contents (<TOC>), I add these lines to Prince’s HTML.css (because obviously I can’t touch the wikipedia source files):

#toc ul {prince-pdf-tag-type: TOC;} /*Table of Contents */
#toc li {prince-pdf-tag-type: TOCI;} /* TOC item */

This produces the following tag structure:

Acrobat screenshot showing PDF table of contents based on the wikipedia table of contents

In one of my personal sites, I use HTML <nav> as the wrapper for my internal navigation, so would use these declaration instead:

nav ul {prince-pdf-tag-type: TOC;}
nav li {prince-pdf-tag-type: TOCI;}

Only internal links are appropriate for a PDF Table of Contents, which is why Prince can’t automatically map <nav> to <TOC> but makes it easy for you to do so, either by editing html.css directly, or by pulling in a supplementary stylesheet.

Mapping when semantic and styling conflict

There are a number of tricky questions when it comes to tagging when markup and style conflict. For example, consider this markup which is used to “fake” a bulleted list visually:


<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<title>My lovely PDF</title>
<style>
div div {display:list-item;
    list-style-type: disc;
    list-style-position: inside;}
</style>
<div>

    <div>One</div>
    <div>Two</div>
    <div>Three</div>

</div>

Browsers render it something like this:

what looks like a bulleted list in a browser

But this merely looks like a bulleted list — it isn’t structurally anything other than three meaningless <div>s. If you need this to be tagged in the output PDF as a list (so a screen reader user can use a keyboard short cut to jump from list to list, for example), you can use these lines of CSS:

body>div {prince-pdf-tag-type: UL;}
div div {prince-pdf-tag-type: LI;}

Prince creates custom OL-L and UL-L tags which are role-mapped to PDF’s list structure tag <L>. Prince also sets the ListNumbering attribute when it can infer it.

Mapping ARIA roles

Often, developers supplement their HTML with ARIA roles. This can be particularly useful when retrofitting legacy markup to be accessible, especially when that markup contains few semantic elements — the usual example is adding role=button to a set of nested <div>s that are styled to look like a button.

Prince does not do anything special with ARIA roles, partly because, as webaim reports,

they are often used to override correct HTML semantics and thus present incorrect information or interactions to screen reader users

But by supplementing Prince’s html.css, an author can map elements with specific ARIA roles to PDF tags. For example, if your webpage has many <div role=”article”> you can map these to pdf <Art> tags thus:

div[role="article"] {prince-pdf-tag-type: Art;}

Conclusion

As with HTML, the more structured and semantic the markup is, the better the output will be. But of course, Prince cannot verify that alternate text is an accurate description of the function of an image. Ultimately claiming that a document meets the PDF/UA-1 profile actually requires some human review, so Prince has to trust that the author has done their part in terms of making the input intelligible. Using Prince, it’s very easy to turn long documents —even whole books— into accessible and attractive PDFs.

Reading List 233

Want my reading lists sent straight to your inbox? Sign up and Mr Mailchimp will send it your way!

A short note on HTML5 article, section and hgroup

This conference season I’ve spoken at some events for non-frontenders, suggesting that people invest time in learning the semantics of HTML. After all, there are only 120(ish) elements; the average two year old knows 100 words and by the time a child is three will have a vocabulary of over 300 words.

A few people asked me the difference between <article> and <section>. My reply: don’t worry. Simply, don’t use <section>. its only use is in the HTML Document Outline Algorithm, which isn’t implemented anywhere, and seemingly never will be. For the same reason, don’t worry about the <hgroup> element.

But do use <article>, and not just for blog posts/ news stories. It’s not just for articles in the news sense, it’s for discrete self-contained things. Think “article of clothing”, not “magazine article”. So a list of videos should have each one (and its description) wrapped in an <article>. A list of products, similarly. Consider adding microdata from schema.org, as that will give you better search engine results and look better on Apple watches.

And, of course, do use <main>, <nav>, <header> and <footer>. It’s really useful for screen reader users – see my article The practical value of semantic HTML.

Happy marking up!

Why would a screen reader user have a braille display?

Last week, I was invited to address the annual conference of the UK Association for Accessible Formats. I found myself sitting next to a man with these two refreshable braille displays, so I asked him what the difference is.

Two similar refreshable braille displays, side by side

On the left is his old VarioUltra 20, which can connect to devices via USB, Bluetooth, and can take a 32MS SD card, for offline use (reading a book, for example). It’s also a note-taker. He told me it cost around £2500. On the right is his new Orbit Reader 20, “the world’s most affordable Refreshable Braille Display” with similar functionality, which costs £500.

As he wasn’t deaf-blind, I asked why he uses such expensive equipment, when devices have built-in free screen readers. One of his reasons was, in retrospect, so blazingly obvious, and so human.

He likes to read his kids bedtime stories. With the braille display, he can read without a synthesised voice in his ear. Therefore, he could do all the characters’ voices himself to entertain his children.

My take-home from this: Of course free screen readers are an enormous boon, but each person has their own reasons for choosing their assistive technologies. Accessibility isn’t a technological problem to be solved. It’s an essential part of the human condition: we all have different needs and abilities.

Reading List 232

Want my reading lists sent straight to your inbox? Sign up and Mr Mailchimp will send it your way!

Cloning a live WordPress site to work on locally

I’m doing some changes to this WordPress site and wanted to get out of the loop of FTPing a new version of my CSS to the live server and refreshing the browser. Rather than clone the site and setup a dev server, I wanted to host it on my local machine so the cycle of changing and testing would be faster and I could work online.

Nice people on Twitter recommended Local By Flywheel which was easy to install and get going (no dependancy rabbit hole) and which allows you to locally host multiple sites. It also has a really intuitive UI.

To clone my live site, I installed the BackUpWordPress plugin, told it to backup the MySQL database and the files (eg, the theme, plugins etc) and let it run. It exports a file that Local by Flywheel can easily injest – simply drag and drop it onto Local’s start screen. (There’s a handy video that shows how to do it.)

For some reason, although I use the excellent Make Paths Relative plugin, the link to my main stylesheet uses the absolute path, so I edited my local header.php (in ⁨Users⁩ ▸ ⁨brucelawson⁩ ▸ ⁨Local Sites⁩ ▸ ⁨brucelawsoncouk1558709320complete201905241-clone⁩ ▸ ⁨app⁩ ▸ ⁨public⁩ ▸ ⁨wp-content⁩ ▸ ⁨themes⁩ ▸ ⁨HTML5⁩⁩ ) to point to the local copy of the CSS:

link rel="stylesheet" href="http://brucelawson.local/wp-content/themes/HTML5/style.css" media="screen".

And that’s it – fire up Local, start the server, get coding!

If you’re having problems with the local wp-admin redirecting to your live site’s admin page, Flywheel engineers suggest:

  1. Go to the site in Local
  2. Go to Database » Adminer
  3. Go to the wp_XXXXX_options table (click the select button beside it in the sidebar)
  4. Make sure both the siteurl and home options are set to the appropriate local domain. If not, use the edit links to change them.

Reading List 231