Web Data Commons Analysis Result
Hannes Mühleisen, Web-Based Systems Group

Motivation

In the Web Data Commons project, we have collected structured data embedded in Web pages using a variety of formats and converted them to RDF. In particular, we have considered Microformats, RDFa and Microdata annotations. Microformats uses a set of well-known HTML constructs to add semantics to HTML elements. However, it is limited in its expressivity, as only a limited number of formats for well-defined use cases exist. On the other hand, RDFa and Microdata can use arbitrary vocabularies, and - together with their flexible data formats - are therefore able to express arbitrary data.

In this analysis, we tried to answer the question of how RDFa and Microdata annotations are used, in particular, what kind of data is annotated using which vocabularies. However, since many vocabularies cover a wide range of different things, ranging from barber shops to fictional characters, we were also interested in the parts of vocabularies actually used. Therefore, we analyze usage on three levels: vocabulary, class and property.

In our previous analysis, we have used the number of entities (identifiable subsets of data grouped by an equivalent RDF subject) as a basic metric to track the popularity of classes and properties. However, we have found that this metric is vulnerable. If a single site publishes massive amounts of entities of a single type, this would be reflected in our results. However, adoption from a single organization does in most cases not indicate increased popularity of a format. Counting the amount of URLs a vocabulary, class or property appears on faces similar problems, in particularly when we consider our base data, which covers some websites more thoroughly than others. We have therefore adopted the notion of the Pay-Level Domain (PLD) as an additional dimension for our metrics. The PLD is a sub-domain of a public top-level domain, for which users usually pay for. PLDs allow us to identify a realm, where a single user or organization is likely to be in control. For example, the PLD for www.example.com would be example.com. If we count the occurrence of vocabularies, classes and properties only once per PLD, these figures are likely to be more informative about the actual popularity of the annotation vocabularies.

Hypothesis

Between the publication of our two datasets in 2009/2010 and 2012 there has been a tremendous uptake in publishing structured data on web pages, mainly driven by large Internet companies such as Google and Facebook. These large companies benefit from consuming structured data from web pages in the way of being able to display more and more meaningful information to their users. However, both RDFa and Microdata are not fixed on only supporting these consumers. Rather, in theory every group large enough could publish and use their own vocabulary and thereby exchange information in a structured way using only the already existing web pages. However, motivations for this are yet unclear. Our hypothesis for vocabulary usage in embedding structured data on web pages was therefore that only data supported by the large players is published in large amounts and on a large number of Internet properties. Conveniently, the two main vocabularies in this area, schema.org and the Open Graph Protocol recommend (but do not enforce) different encoding formats. We therefore also expect these formats to have the respective popularity.

Experimental Setup

In order to parse and analyze all RDFa and Microdata triples from the two dataset analyzed so far from Web Data Commons, we have chosen to use Apache Pig, which supports Big Data Analysis by providing a convenient scripting front-end for creating Hadoop Map/Reduce jobs, which are run on Hadoop. These Hadoop jobs were then run on an Amazon Elastic MapReduce cluster. From there, we collected and post-processed the results on a local machine. The following metrics were collected for both datasets as well as RDFa and Microdata separately:

We have filtered the results two ways: First, we have removed obviously broken vocabulary, class and property URLs (e.g. http:/example.comAsdf). Second, we have removed the long tail from the results. In particular, frequencies below 100 were removed for PLD metrics, and frequencies below 1,000 were removed for the entity and URL metrics. Furthermore, we have calculated the percentage of all frequencies against the total number of entities, URLs and PLDs.

The Apache Pig scripts used to create these results as well as the required Java library can be found in our Subversion Repository.

Results and Discussion

As predicted, using PLDs as a metric greatly increased the clarity of our results. When comparing the used vocabularies, we can see how the PLD table, we can see for example how the Dbpedia ontology was found on 50,475 URLs with 507,553 entities, but is not present on the PLD table in the 2012-rdfa dataset. This is due to the fact that Dbpedia URLs were rather frequent in the Web crawl we used to collect structured data from. Nonetheless, this vocabulary is not widely used, which clearly shows on the PLD table. The remainder of this section will therefore focus on the PLD tables, we will start by discussing the data sets, omitting the 2010 microdata results due to their size.

For the 2010-rdfa data set, we have found a rather diverse number of vocabularies being used, even though Facebook's Open Graph Protocol (OGP) is already strong, being used on ca. 40% of PLDs. Also strongly represented were the Dublin Core and Creative Commons vocabularies. The distribution of classes is rather even, but a tendency towards product data in Google's schema.org predecessor data-vocabulary.org is visible. However, the property distribution is another matter altogether. Since many entities do not carry a type (most notably Facebook's OGP templates), the property distribution is also very important to rate an vocabularies popularity. In this case, the property popularity figures confirms our observation from the vocabulary distribution. The property co-occurence groups by PLDs show that only groups with OGP properties, being very like due to people copy-pasting templates from the OGP documentation to their web page.

If we compare these results now to the 2012-rdfa data set, we should be able to view the development of data annotated using this formats from the vocabulary usage viewpoint: On the vocabulary popularity table, we can see rather unchanged popularity for OGP. Furthermore, the Friend-of-a-Friend vocabulary enjoys an unexpected increase of popularity by ca. 10%. The same development is visible to a lesser degree for the SIOC and RDF Site Summary vocabularies. This development is also reflected in the classes and properties tables, with the already known difference in class usage between OGP and the other vocabularies. While more co-occurence groups could be identified than in the 2010-rdfa dataset, OGP properties still by far outweighs those of other vocabularies.

The Microdata annotation format was popularized through the schema.org initiative driven by Bing, Google, Yahoo and Yandex. Therefore it is no surprise, when the vocabulary popularity table for the 2012-microdata data set only shows this vocabulary and its predecessor to be relevant at all. The classes table now shows us the popular parts of these vast vocabularies: Addresses, page navigation information in th form of "breadcrumbs", products, offers and business information occurred on more then 10% of the PLDs using Microdata annotations. This view is confirmed by the properties table and the co-occurence groups, where properties and groups for these popular classes are also popular.

Conclusion

From these results, we were able to fully confirm our hypotheses: The recent uptake in embedding structured data into HTML pages is almost single-handedly driven the consumption of this data by large Internet companies. Furthermore, Webmasters seem to strongly respect the recommendations of these data consumers, even down to the encoding format. This movement can be characterized as a pull-effect: Whatever is consumed, is published. Grassroots movements for vocabularies created by communities can only seen in the RDFa dataset to some degree.

Full Results


2010-rdfa2010-microdata2012-rdfa2012-microdata
Total Entities38,667,304261,39795,198,27195,734,992
Total URLs11,062,80254,67867,091,71426,675,080
Total PLDs62,2851,217241,55044,331
Vocabs/EntitiesHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Vocabs/URLsHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Vocabs/PLDsHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Classes/EntitiesHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Classes/URLsHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Classes/PLDsHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Properties/EntitiesHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Properties/URLsHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Properties/PLDsHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Property Co-Occurence/EntitiesHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Property Co-Occurence/URLsHTML/CSVHTML/CSVHTML/CSVHTML/CSV
Property Co-Occurence/PLDsHTML/CSVHTML/CSVHTML/CSVHTML/CSV

Additional Downloads

Related Work

Peter Mika and Tim Potter, "Metadata Statistics for a Large Web Corpus", Linked Data on the Web (LDOW2012). They have also performed an analysis using PLDs as a metric. Their results, particularly Table 7 and Table 9 are consistent to ours. This indicates the validity of our methodology and data set.

Acknowledgements

The author would like to thank Andreas Schultz and Anja Jentsch for their help. We would also like to thank Amazon Web Services (AWS) for providing us with a research grant to pay for their services.