Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - October 2024

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the October 2024 release of the Common Crawl.

In summary, we found structured data within 1.3 billion HTML pages out of the 2.4 billion pages contained in the crawl (51.25%). These pages originate from 16.5 million different pay-level-domains out of the 37.5 million pay-level-domains covered by the crawl (44.12%). Altogether, the extracted data sets consist of 74 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

In addition, we have extracted schema.org class-specific datasets from the Microdata and JSON-LD corpora.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl Date October 2024
Total Data 77.33 Terabyte (compressed)
Parsed HTML URLs 2,391,039,772
URLs with Triples 1,245,622,627
Domains in Crawl 37,447,141
Domains with Triples 16,525,070
Typed Entities 15,647,463,083
Triples 73,993,669,093
Size of Extracted Data 1.4 Terabyte (compressed)

Results per Format


Format Domains URLs Typed Entities Triples
html-embedded-jsonld 11,562,359 833,818,654 9,689,931,985 47,979,634,597
html-microdata 7,599,792 574,648,578 4,616,719,570 21,825,355,450
html-mf-hcard 3,522,517 179,698,044 1,144,374,197 3,440,560,541
html-rdfa 474,635 49,636,704 149,228,042 458,151,723
html-mf-xfn 270,304 16,957,879 21,511,452 194,009,215
html-mf-adr 97,811 5,750,460 11,801,204 40,888,969
html-mf-geo 18,109 1,336,547 2,552,039 6,896,307
html-mf-hcalendar 14,076 777,304 4,507,212 18,079,452
html-mf-hreview 11,993 689,081 1,904,373 13,473,054
html-mf-hlisting 5,381 79,436 3,816,635 12,539,856
html-mf-hrecipe 1,805 102,420 626,903 2,755,421
html-mf2-h-adr 16,245 198,665 292,704 847,497
html-mf-hresume 61 1,103 3,023 7,560
html-mf-species 291 75,071 193,744 469,451
overall 16,525,070 1,221,984,070 15,647,463,083 73,993,669,093



Top Domains by Extracted Triples


  1. blogspot.com (616,572,399 triples)
  2. airbnb.com (220,941,165 triples)
  3. rambler.ru (143,611,594 triples)
  4. pinterest.com (133,683,930 triples)
  5. pinkoi.com (127,030,064 triples)
  6. eslite.com (108,023,469 triples)
  7. semrush.com (96,718,225 triples)
  8. wikipedia.org (81,265,926 triples)
  9. cleanpng.com (73,791,080 triples)
  10. google.com (70,930,906 triples)
  11. made-in-china.com (65,890,676 triples)
  12. clinicsoftware.com (61,850,352 triples)
  13. searchtruth.com (53,155,836 triples)
  14. uol.com.br (52,597,205 triples)
  15. apple.com (47,325,264 triples)
  16. fandom.com (43,707,791 triples)
  17. boohoo.com (43,673,488 triples)
  18. justia.com (41,724,118 triples)
  19. kayak.com (40,406,850 triples)
  20. minnesotamonthly.com (39,294,179 triples)
  21. More

Top Domains by URLs with Triples


  1. blogspot.com (12,944,926 urls)
  2. wikipedia.org (3,704,547 urls)
  3. pinterest.com (2,518,259 urls)
  4. photoshelter.com (1,624,310 urls)
  5. fandom.com (1,399,738 urls)
  6. aif.ru (961,021 urls)
  7. made-in-china.com (831,723 urls)
  8. wordpress.org (818,294 urls)
  9. myshopify.com (788,427 urls)
  10. altervista.org (705,651 urls)
  11. uol.com.br (676,703 urls)
  12. ox.ac.uk (581,632 urls)
  13. yahoo.com (563,158 urls)
  14. ning.com (541,963 urls)
  15. typepad.com (526,853 urls)
  16. google.com (526,740 urls)
  17. bmj.com (485,893 urls)
  18. airbnb.com (477,698 urls)
  19. hatenablog.com (470,752 urls)
  20. infoisinfo-au.com (430,617 urls)
  21. More

Extractor html-embedded-jsonld


Triples Extracted 47,979,634,597
URLs with Triples 833,818,654
Average Triples per URL 57.54
Domains with Triples 11,562,359
Average Triples per Domain 4,149.64
Typed Entities 9,689,931,985
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx

Extractor html-microdata


Triples Extracted 21,825,355,450
URLs with Triples 574,648,578
Average Triples per URL 37.98
Domains with Triples 7,599,792
Average Triples per Domain 2,871.84
Typed Entities 4,616,719,570
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx

Extractor html-mf-hcard


Triples Extracted 3,440,560,541
URLs with Triples 179,698,044
Average Triples per URL 19.15
Domains with Triples 3,522,517
Average Triples per Domain 976.73
Typed Entities 1,144,374,197
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-rdfa


Triples Extracted 458,151,723
URLs with Triples 49,636,704
Average Triples per URL 9.23
Domains with Triples 474,635
Average Triples per Domain 965.27
Typed Entities 149,228,042
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx

Extractor html-mf-xfn


Triples Extracted 194,009,215
URLs with Triples 16,957,879
Average Triples per URL 11.44
Domains with Triples 270,304
Average Triples per Domain 717.74
Typed Entities 21,511,452
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-adr


Triples Extracted 40,888,969
URLs with Triples 5,750,460
Average Triples per URL 7.11
Domains with Triples 97,811
Average Triples per Domain 418.04
Typed Entities 11,801,204
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-geo


Triples Extracted 6,896,307
URLs with Triples 1,336,547
Average Triples per URL 5.16
Domains with Triples 18,109
Average Triples per Domain 380.82
Typed Entities 2,552,039
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted 18,079,452
URLs with Triples 777,304
Average Triples per URL 23.26
Domains with Triples 14,076
Average Triples per Domain 1,284.42
Typed Entities 4,507,212
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted 13,473,054
URLs with Triples 689,081
Average Triples per URL 19.55
Domains with Triples 11,993
Average Triples per Domain 1,123.41
Typed Entities 1,904,373
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted 12,539,856
URLs with Triples 79,436
Average Triples per URL 157.86
Domains with Triples 5,381
Average Triples per Domain 2,330.4
Typed Entities 3,816,635
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted 2,755,421
URLs with Triples 102,420
Average Triples per URL 26.9
Domains with Triples 1,805
Average Triples per Domain 1,526.55
Typed Entities 626,903
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted 7,560
URLs with Triples 1,103
Average Triples per URL 6.85
Domains with Triples 61
Average Triples per Domain 123.93
Typed Entities 3,023
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted 469,451
URLs with Triples 75,071
Average Triples per URL 6.25
Domains with Triples 291
Average Triples per Domain 1,613.23
Typed Entities 193,744
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count