One of my aims with this site is to test out and better understand technologies and techniques, new and old, that help to make websites work better. Currently, one of the most exciting areas of newfangledness is semantic search, Google’s Knowledge Graph and microdata.

René Magritte, The Treachery of Images
René Magritte, taking labelling seriously since 1898

For the uninitiated, microdata is a way to add machine readable labels to information on a web page. It can help a search engine to understand the meaning of your content by saying, for example, “this is a piece of academic research”, “this is the name of the blog author”, or “this is product review”.

Each individual label is quite simple, but properly used they can combine to give a detailed picture of what a piece of web content actually means, which is a basis of semantic search – i.e., search engines providing more accurate results by better understanding content and search intent.

We’re already seeing the beginning of this with Google’s Knowledge Graph, but even before this you could type “two times two” into the Google search field and get the answer both to the arithmetical problem, 2×2, as well as matches for the words. Clever stuff, if you think about it.

Schema.org, a standard vocabulary

Schema.org provides a microdata vocabulary, which means it’s one possible set of “labels” that you can choose from to add a layer of machine readable data to, well, almost anything.

It’s an important one, too, since it’s developed and supported by Google, Bing, Yahoo and Yandex. Clearly, a standard vocabulary is fundamental to the usefulness of microdata – adding labels to our data would be a nuisance if each search engine had its own, proprietary set of labels – so one that is developed by the world’s dominant search engines is one to watch.

The vocabulary is divided into broad groups and sub-groups of data, each of which contains markers for specific, individual pieces of data, such as an author’s name, the date of publication or a review score. The hierarchy of these groups and individual labels is quite important since the labels need to be used within the scope of their group, so for best results (and tidy code) that means a bit of planning when structuring HTML.

Currently, much focus is on the bits of the vocabulary that do something obvious and visible through Rich Snippets, for example star-rating symbols for search results relating to reviews, or lists of concert event details. Useful and fun as these undeniably are, they only scratch the surface of what is possible.

While we’ve yet to find out how search engines will deal with the rest of the vocabulary – article markup, for example – there are already reports that Bing appears to respond favourably to sites that use it.

Microdata for WordPress templates

Some basic applications for a WordPress blog might be to identify:

  • That a web page is a blog post
  • The author’s name
  • The date the post was published
  • The headline or title of the post
  • The text of the blog post itself

Depending on the topics the blog covers, you might want to go further and add markup for reviews or a specific topic area, such as medical studies or products.

Editing WordPress PHP

Despite a couple of years’ experience in building WordPress templates, my knowledge of PHP had never grown much beyond knowing enough to embed the right code in the places to make WordPress work. But my desire to build microdata into this site’s template forced me to delve a little deeper, learning more about syntax and behaviour to achieve slightly more complex aims.

As you’d expect, there’s already a healthy crop of free plug-ins that will do the grunt work for you, but in this case I feel that doing it manually is, for most applications, a preferable solution. Although I’ve nothing against using a plug-in if it works well (for example I use Yoast WordPress SEO plugin to take care of a lot of SEO basics like unique meta descriptions and robot control), it’s a practice I like to limit, partly to avoid bloat from the additional JavaScript and CSS files most plug-ins introduce, and partly because I often don’t like the code they produce.

Basic microdata to describe a blog post

For anyone with a grasp of the rudiments of WordPress template building, adding some basic microdata should be easy. For example, a blog post is usually displayed using the template file single.php, so a few simple HTML edits in there will get us quite a long way:

<div id="blog-post" itemscope itemtype="http://schema.org/BlogPosting">
<h2 itemprop="name"><a href="<?php the_permalink() ?>"><?php the_title(); ?></a></h2>
<h3 itemprop="datePublished" content="2012-08-26T04:26:32+00:00"><?php the_time('F jS, Y') ?></h3>
<h4 itemprop="author"><?php the_author(); ?></h4>
<div class="entry" itemprop="articleBody">
<?php the_content(); ?>
</div>

Note that, with the publish date, we need to add a machine readable form within the content”attribute. I’ve done this with a simple bit of PHP:

<?php the_time('c'); ?>

Microdata for WordPress images

Finally, it’d be nice to mark up our images with microdata, too. WordPress generates the code for images automatically, so other than manually altering it for every image posted, the only sensible means of adding microdata are to edit the core file responsible for it (media.php), or write a custom filter within the theme’s functions.php file.

The latter will dynamically grab the code, alter it, and then render it on the page, and lives in the theme folder. The former would break every time WordPress updates, so the functions file emerges as the only robust solution.

The filter needs to amend the shortcode “img_caption_shortcode”, and in my implementation returns something like this:

return '<figure class="wp-caption ' . esc_attr($align) . '" itemscope itemtype="http://schema.org/ImageObject">' . do_shortcode( $content ) . '<figcaption class="wp-caption-text" itemprop="caption">' . $caption . '</figcaption></figure>';

Microdata for reviews: changing schema by blog category

Having got that far, I wanted to do something a bit more complex. As it is, if I were to publish a review in that template, there’d be no microdata to describe it as a review, and no fancy rich snippets in search results either. So it struck me as a good idea that, if I were to place a post in a category named “Reviews”, the microdata would change accordingly.

For example, the containing structural element now carries the following:

<?php echo in_category( 'Reviews' ) ? 'itemscope itemtype="http://schema.org/Review"' : 'itemscope itemtype="http://schema.org/BlogPosting"'; ?>

This simply checks if the post is in the category “Reviews” and, if so, applies the appropriate itemtype. If not, it defaults to describing it as a blog post. Using the same basic principle, it’s easy enough to go through the entire template and make all elements respond accordingly to the category.

To actually get the fancy stars in search results, though, it needs some way of attaching a review rating. In keeping with my minimalist tastes, all I wanted was a simple rating in plain text at the end of a review. This is easily done using custom meta fields in WordPress. In my implementation, that requires something like this in the post template:

<p>Overall rating:</strong> <span itemprop="reviewRating">' .get_post_meta(get_the_id(), 'item_rating', true).'</span>/5</p>

Going further

While a lot of the above isn’t yet explicitly supported by Google, Bing, etc., (reviews being an exception), I think it’s a question of when, not if. It’s perhaps inevitable that elements of the specification will change before then, but getting started now means getting a head start on addressing the various technical and practical challenges of actually using the stuff… so, why wait?