Web Data Commons - RDFa, Microdata, and Microformats Data Sets - December 2014

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the December 2014 release of the Common Crawl.

In summary, we found structured data within 620 million HTML pages out of the 2.01 billion pages contained in the crawl (30%). These pages originate from 2.72 million different pay-level-domains out of the 15.68 million pay-level-domains covered by the crawl (17%). Altogether, the extracted data sets consist of 20.48 billion RDF quads.

An analysis of trends in the adoption of the different markup formats as well as trends in the adoption of selected RDFa and Schema.org classes is presented on the WDC RDFa, Microdata, and Microformats Data Sets Series page.

Instructions on how to download the RDFa, Microdata, and Microformats data sets are given on the page how to get the data.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl DateWinter 2014
Total Data160 Terabyte(compressed)
Parsed HTML URLs2,014,175,679
URLs with Triples620,151,400
Domains in Crawl15,668,667
Domains with Triples2,722,425
Typed Entities5,516,068,263
Triples20,484,755,485

Results per Format


FormatDomainsURLsTyped EntitiesTriples
html-rdfa571,581257,251,367405,541,2832,566,827,347
html-microdata819,990292,601,8242,209,497,2819,438,536,906
html-mf-geo20,2614,619,6647,712,04120,348,760
html-mf-hcalendar24,2083,496,06134,595,069169,557,078
html-mf-hcard1,095,517101,606,0091,349,620,3003,850,290,103
html-mf-hcard including html-mf-adr1,269,607131,027,2511,430,171,6574,104,910,682
html-mf-hlisting3,167202,8894,473,63118,838,183
html-mf-hrecipe3,476630,4025,781,21724,756,234
html-mf-hresume15516,34382,751462,002
html-mf-hreview13,7722,496,30316,186,86869,802,632
html-mf-species9631,444218,463653,111
html-mf-xfn170,20217,032,64652,187,702219,772,447

Top Domains by Extracted Triples


  1. blogspot.com (633,878,938 triples)
  2. fotolia.com (414,922,908 triples)
  3. tripadvisor.com (301,013,636 triples)
  4. crateandbarrel.com (270,245,981 triples)
  5. aliexpress.com (222,417,968 triples)
  6. flightaware.com (216,257,668 triples)
  7. competitivecyclist.com (190,007,206 triples)
  8. snagajob.com (180,572,895 triples)
  9. coupons.com (176,389,752 triples)
  10. ebay.com.au (174,017,508 triples)
  11. repairpal.com (173,953,875 triples)
  12. bentgate.com (172,775,199 triples)
  13. meetup.com (159,046,792 triples)
  14. ebay.co.uk (150,555,502 triples)
  15. ebay.com (148,915,359 triples)
  16. backcountry.com (141,837,844 triples)
  17. dreamstime.com (121,683,540 triples)
  18. ebay.ca (116,875,013 triples)
  19. wordpress.com (115,893,246 triples)
  20. indeed.com (113,539,582 triples)
  21. More

Top Domains by URLs with Triples


  1. blogspot.com (18,958,728 urls)
  2. stackexchange.com (6,639,125 urls)
  3. oclc.org (3,895,587 urls)
  4. tripadvisor.com (3,873,106 urls)
  5. wikipedia.org (2,831,109 urls)
  6. go.com (2,696,399 urls)
  7. wordpress.com (2,337,349 urls)
  8. atgstores.com (2,089,670 urls)
  9. mlb.com (1,796,733 urls)
  10. epicsports.com (1,710,978 urls)
  11. cnet.com (1,560,693 urls)
  12. wsj.com (1,554,632 urls)
  13. google.com (1,482,164 urls)
  14. agoda.com (1,476,946 urls)
  15. dreamstime.com (1,473,302 urls)
  16. hotels.com (1,386,784 urls)
  17. stackoverflow.com (1,267,583 urls)
  18. grouprecipes.com (1,196,421 urls)
  19. theguardian.com (1,112,756 urls)
  20. packersproshop.com (1,094,342 urls)
  21. More

Extractor html-rdfa


Triples Extracted2,566,827,347
URLs with Triples257,251,367
Average Triples per URL9.9779
Domains with Triples571,581
Average Triples per Domain4490.7499
Top Domains by Extracted TriplesShow top domains
Typed Entities405,541,283
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx (127kb)

Extractor html-microdata


Triples Extracted9,438,536,906
URLs with Triples292,601,824
Average Triples per URL101.9260
Domains with Triples819,990
Average Triples per Domain11,510.55123
Top Domains by Extracted TriplesShow top domains
Typed Entities2,209,497,281
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx (221kb)

Extractor html-mf-geo


Triples Extracted20,348,760
URLs with Triples4,619,664
Average Triples per URL4.4048
Domains with Triples20,261
Average Triples per Domain1,004.3315
Top Domains by Extracted TriplesShow top domains
Typed Entities7,712,041
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted169,557,078
URLs with Triples3,496,061
Average Triples per URL48.4995
Domains with Triples24,208
Average Triples per Domain7,004.1754
Top Domains by Extracted TriplesShow top domains
Typed Entities34,595,069
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcard


Triples Extracted3,850,290,103
URLs with Triples101,606,009
Average Triples per URL37.8943
Domains with Triples1,095,517
Average Triples per Domain3,514.5873
Top Domains by Extracted TriplesShow top domains
Typed Entities1,349,620,300
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted18,838,183
URLs with Triples202,889
Average Triples per URL92.8497
Domains with Triples3,167
Average Triples per Domain5,948.2738
Top Domains by Extracted TriplesShow top domains
Typed Entities4,473,631
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted24,756,234
URLs with Triples630,402
Average Triples per URL39.2706
Domains with Triples3,476
Average Triples per Domain7,122.0466
Top Domains by Extracted TriplesShow top domains
Typed Entities5,781,217
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted462,002
URLs with Triples16,343
Average Triples per URL28.2691
Domains with Triples155
Average Triples per Domain2,980.6581
Top Domains by Extracted TriplesShow top domains
Typed Entities82,751
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted69,802,632
URLs with Triples2,496,303
Average Triples per URL27.9624
Domains with Triples13,772
Average Triples per Domain5,068.4455
Top Domains by Extracted TriplesShow top domains
Typed Entities16,186,868
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted653,111
URLs with Triples31,444
Average Triples per URL20.7706
Domains with Triples96
Average Triples per Domain6,803.2396
Top Domains by Extracted TriplesShow top domains
Typed Entities218,463
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-xfn


Triples Extracted219,772,447
URLs with Triples17,032,646
Average Triples per URL12.9030
Domains with Triples170,202
Average Triples per Domain1,291.2448
Top Domains by Extracted TriplesShow top domains
Typed Entities52,187,702
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count