Complementary Efforts of the W3C Mirror the Trend
Two and one-half years ago the triumvirate of Google, Bing and Yahoo! — soon to be joined by Yandex, the major Russian search engine — released schema.org. The purpose of schema.org is to bring a simple means for Web site owners and authors to tag their sites with a vocabulary, designed to be understandable by search engines, to describe common things and activities on the Web. Though informed and led by innovators with impeccable backgrounds in the early semantic Web and knowledge representation [1], the founders of schema.org also understood that the Web is a messy place with often wrong syntax and usage. Their stated commitment to simplicity and practicality caused me to state the day of release that schema.org was “perhaps the most important event for the structured Web since RDF was released a dozen years ago.”
Just a week ago schema.org version 1.0e was released. That event, plus much else in recent months, is suggesting a real maturity and take up of schema.org. It looks like the promise of schema.org is being fulfilled.
Growth and Impact of the schema
When first released, schema.org provided nearly 300 structured record types that may be used to tag information in Web pages. Via various collaborative processes since, and with an active discussion group, the schema.org vocabulary has about doubled in size. Some key areas of expansion have been in describing various actions, adding basic medical terms, product and transaction expansion via linkages to GoodRelations, civic services, and most recently, accessibility. Many other additions are in progress.
In his keynote address at ISWC 2013 in Sydney on October 23, Ramathan Guha [1] reported that 15 percent of crawled pages and 5 million sites have some schema.org markup. We can also see that some of the most widely used content management systems on the Web, notably including WordPress, Joomla and Drupal, have or plan to have native schema.org support. These tooling trends are important because, though designed for simple manual markup, it does require a bit of attention and skill to get schema.org markup right. Having markup added to pages automatically in the background is the next threshold for even broader adoption.
The ability of the schema.org vocabulary to capture essential domain facts as structured data is reflected in the growing list of prominent sites tagging with schema.org. According to Guha, these are some of the prominent sites now using schema.org:
Category | Prominent Sites |
News | Nytimes, guardian.com, bbc.co.uk |
Movies | imdb, rottentomatoes, movies.com |
Jobs / careers | careerjet.com, monster.com, indeed.com |
People | linkedin.com |
Products | ebay.com, alibaba.com, sears.com, cafepress.com, sulit.com, fotolia.com |
Videos | youtube, dailymotion, frequency.com, vinebox.com |
Medical | cvs.com, drugs.com |
Local | yelp.com, allmenus.com, urbanspoon.com |
Events | wherevent.com, meetup.com, zillow.com, eventful |
Music | last.fm, myspace.com, soundcloud.com |
Key Applications | pinterest.com, opentable.com |
Examples like Pinterest show how schema.org can also provide a central organizing point for new ventures and applications. There are also key relationships between schema.org and new search initiatives such as Google’s Now or its knowledge graph.
From day one schema.org was released with a mechanism for other parties to extend its vocabulary. However, more recently, there has been a significant increase of attention on questions of interoperability and relation to other existing vocabularies. To wit:
- Prominent knowledge representation experts, such as Peter Patel-Schneider, have become active to suggest better interoperability and design considerations
- The root of schema.org is now recognized as owl:Thing
- Much discussion has occurred on integration or interoperability or not with SKOS, the simple knowledge organizational vocabulary
- Provisions have been added to capture concepts such as domain and range
- Calls have been made to increase the number of examples and documentation, including enforcing consistency across the vocabulary.
To be clear, it was never the intent for schema.org to become a single, governing vocabulary for the Web. Nonetheless, these broader means to enable others to tie in effectively with it are an indicator that schema.org’s sponsors are serious about finding effective common grounds.
Aside from certain areas such as recipes or claiming site or blog ownership, it has been unclear how the search engines are actually using schema.org markup or not. The sponsors have oft stated a go-slow attitude to see if the marketplace indeed embraces the vocabulary or not. I’m also sure that the sponsors, as familiar as they are with spam and erroneous markup, have also wanted to put in place effective ingest procedures that do not reduce the quality of their search indexes.
Getting Dan Brickley, one of the better-known individuals in RDF and the semantic Web, to act as schema.org’s liaison to the broader community, and beginning to open up about actual usage and uptake of schema.org are great signs of the sponsors’ commitment to the vocabulary. We should expect to see a much quickened pace and more visibility for schema.org within the search services themselves within the coming months.
W3C’s Complementary Efforts
Meanwhile, back at the ranch, a number of other interesting efforts are occurring within the World Wide Web Consortium (W3C) that are complementary to these trends. As readers of this blog well know, I have argued for some time that RDF makes for a fantastic data model for interoperating disparate content, which our company Structured Dynamics centrally relies upon, but that RDF is not an essential for metadata specification or exchange. Understood serializations based on understood vocabularies — in other words, exactly the design of schema.org — should be sufficient to describe the various types of things and their attributes as may be found on the Web. This idea of structured data in a variety of forms puts control into the hands of content authors. Various markets will determine what makes best sense for them as to how they actually express that structured data.
Last week the W3C announced its retirement of the Semantic Web group, subsuming it instead into the activities of the new W3C Data Activity. The W3C also announced a new group in CSV (comma-separated values) data exchange to go along with recent efforts in JSON-LD (linked data).
These are great trends that reflect a prejudice to adoption. Along with the advances taking place with schema.org, the Web now appears to be entering into a golden age of structured data.
Schema.org has been on the uptake in recent times, mostly for authorship, recipes and reviews for display purposes. I think there will be a greater uptake if search engines provide more information about the potential and future uses of schema.org, best practices to prevent spam activities and how this fits in with the Knowledge Graph.