Gold Standard Design, Statistics, and Download
Dominique Ritze
Oliver Lehmberg
Christian Bizer



This page describes the T2D Gold Standard for evaluating matching systems on the task of matching Web tables to the DBpedia knowledge base.

News

Many HTML tables on the Web are used for layout purposes, but a small fraction of all tables contains structured data [Cafarella2008][Crestan2011]. As this data has a wide coverage, it could potentially be very valuable for filling missing values and extending cross-domain knowledge bases such as DBpedia, YAGO or the Google Knowledge Graph. As a prerequisite for being able to use table data for knowledge base extension, the Web tables need to be matched to the knowledge base in question, meaning that correspondences between the rows of the tables and the entities described in the knowledge base as well as between the columns of the tables and the schema of the knowledge base need to be found.

Different systems have been developed to solve this matching task [Venetis2010][Limaye 2010][Ellis2014][Zhang2014] . Up till now, it was difficult to compare the performance of these systems as they were evaluated using in part non-public Web tables data as well as different knowledge bases. The T2D Gold Standard tries to fill this gap by providing a large set of human-generated correspondences between a public Web table corpus and the DBpedia knowledge base.

T2D Gold Standard contains schema-level correspondences between 1748 Web tables from the English-language subset of the Web Data Commons Web Tables Corpus and DBpedia Version 2014. For 233 out of these tables, all rows have been manually mapped to entities in the DBpedia knowledge base which resulted in 26,124 entity-level correspondences). All correspondences were generated manually, which resulted in an overall effort of about 6 person weeks.

The T2D Gold standard is provided under the terms of the Apache license for public download below.

Contents

1.T2D Gold Standard Overview

The T2D gold standard was designed to fulfill the following requirements:

  1. the gold standard should contain a balanced sample with respect to the
    1. number of rows per table
    2. topics covered
    3. number of property mappings per table
  2. it should cover a realistic subset with characteristics similar to the whole web tables corpus
  3. it should contain high precision correspondences (in case of ambiguous or fuzzy mappings, we decide to exclude it)

In order to address the requirements of having a gold standard that is as large as possible and having correspondences on instance-level (between rows in the web table and resources in DBpedia), the gold standard consists of two parts:

The selection strategy for including Web tables into the gold standard was to first take a random sample from the complete Table corpus. The majority of the tables in this sample could not be mapped to DBpedia or were even non-relational (i.e. layout tables) due to errors make by the table classifier. However, these tables can be helpful to distinguish between tables that describe entities that exist in DBpedia and tables for which an algorithm should detect that no rows can be mapped to DBpedia entities. To increase the amount of actually mappable tables in the gold standard, we specifically searched the corpus for mappable tables (i.e. have at least some overlapping values). For this task we used the Mannheim Search Joins Engine with input tables from DBpedia covering different topics.

2. Table Characteristics

The table below provides basic statistics about the size of the tables covered by the schema-level gold standard as well as the number of columns of each table that can be mapped to DBpedia properties.

number of rowsno property mapping1 property mapping>1 property mappingsum
<20805453461196
<1001931769279
>1005970144273
sum10571325591748

With the largest amount of tables having less than 20 columns and no mapping to a DBpedia property at all, we address the representativeness requirement. Thus, we try to cover a realistic scenario with a lot of non-matchable tables with only few rows. Dividing the gold standard according to the two dimensions can also help to see for which tables a system performs well and where it can be improved.

Concerning the matching task, we call a table a content table if it contains relational data and has at least one column (i.e. the "key column") that is mappable to DBpedia. In the complete gold standard, we have about 763 of these content tables. They are usually larger than the other tables (Ø 206.87 rows, Ø 4.7 columns). Since it is not useful to assign DBpedia classes or properties to non-content tables, only content tables are annotated with this information.

Altogether, the gold standard contains table-to-class correspondences for 91 different DBpedia classes which ensures a broad coverage of different topics. We grouped these classes based on their super classes (called categories) and show the distribution of tables being mapped to each category in Figure 1. All classes that only have the super class "Thing" are assigned to this category. One example of such a class is the class "Drug". Other classes can be consolidated to a category covering different but related classes, e.g. the category "Organization" contains tables about companies, universities political parties, airlines, schools etc.

Fig. 1 - Distribution of mapped tables by topical category.

3. Column Characteristics

IThe tables in the schema-level gold standard alltogether contain 7983 columns of which 4100 columns originate from content tables. Since it is again useless to map a column in a layout table to a DBpedia property, we only create column-to-property correspondences for column of content tables. This results in 2084 correspondences to DBpedia properties. 750 of these are correspondences from the key column to rdfs:label, the other ones are from non-key columns to properties of the DBpedia ontology namespace. All the correspondences together cover 298 different properties.

Figure 2 shows the top 20 properties (without rdfs:label).

Fig. 2 - Top 20 DBpedia properties

Only having columns of a certain data type could also bias the gold standard and encourage system that focus only on matching this data type. Thus, we took almost the same column data type distribution as it is present in the whole corpus. While the gold standard covers 65% columns of type string, the whole corpus has 68%, similar holds for the other data types.

Figure 3 shows the distribution of column data types.

Fig. 3 - Column Data Type Distribution

Including columns with different data types shows the strengths and weaknesses of the similarity functions employed by the matching systems/algorithms.

4. Entity-Level Gold Standard

A second version of the T2D Entity-Level Gold Standard (T2Dv2) is available.

The entity-level gold standard contains 233 tables that have been annotated with a table-to-class correspondence, column-to-property correspondences, and row-to-entity correspondences. Again, we used the Mannheim Search Joins Engine to obtain tables that are likely to be content tables. This subset of the schema-level gold standard covers 33 DBpedia classes and contains 26,124 row-to-entity correspondences. In order to give an overview of the topics of the tables, we used the same categories as before.

Figure 4 shows the distribution of tables per category.

Fig. 4 - Distribution of Tables per Category

The distribution of tables according to the categories does not seem to be very balanced. This results from the fact that tables of different categories significantly differ in their amount of row-to-entity correspondences. Having a look at the amount of correspondences per category (Figure 5), the difference becomes visible.

Fig. 5- Distribution of the Row-to-Entity Correspondences per Category.

The table below provides the average number of correspondences and mapped columns per category.

Category avg. # of entity corresp.avg. # of mapped columns
Work1313.2
Organization652.2
Architectural Structure672.3
Person891.5
Species2502.4
Natural Place832.1
Populated Place1304.1

While tables of the category "Species" can be mapped to a lot of resources, tables about "Architectural Structure" have a tendency to contain less correspondences. Similar holds for the number of mappings to properties, e.g. tables of the category "Populated Place" can on average be mapped to 4 properties while tables of the category "Person" only to 1.5 properties. Besides other characteristics, these differences significantly influence the difficulty of matching a certain table. Having a lot of column-to-property correspondences can help to find better row-to-entity correspondences since more property values are exploitable. The same holds vice versa, with more correspondences to entities exist, the easier becomes the task of finding schem-level correspondences.

As result from the WDC Web Table Extraction 2015, we can extract context information for the tables used in the gold standard. Besides the page title, the text before and after the table, 47 timestamps before a table and 97 timestamps after a table have been identified. Further, for 7 tables we can find a caption of the table.

5. Data Format

The Web tables and the correspondences are provided as CSV files (except for tables with context information, see the download instructions). Fields are separated by the comma (' , ') character and all values are double quoted (' " '). There are three different files types for the gold standard: the class correspondence files, the attribute correspondence files, and the entity correspondence files. In all files, tables are uniquely identified by their name, which is the name of the file that contains the table without extension. The class file contains the class correspondences and the header information for each table. It has the following structure:

table name DBpedia class name DBpedia class URI Header row indices (comma-separated list)
The attributes files contain the attribute correspondences and the key information for the tables. For each table, one attribute file with the same name exists. These files have the following structure:
DBpedia property URI Column header (value from the first row) Is key column (boolean) Column index
The entity files contain the entity correspondences for the tables. For each table, one entity file with the same name exists. These files have the following structure:
DBpedia resource URI Key value Row index

6. Download

To download the corpus of Web tables as well as the correspondences use the following links:

Gold Standard Tables Class Correspondences Attribute
Correspondences
Entity
Correspondences
Example tables / correspondences sample_table.csv sample_classes.csv sample_attributes.csv sample_entities.csv
Complete gold standard tables_complete.tar.gz classes_complete.csv attributes_complete.tar.gz
Instance-level gold standard tables_instance.tar.gz classes_instance.csv attributes_instance.tar.gz entities_instance.tar.gz
Instance-level goldstandard with context tables_instance_context.tar.gz
Extended Instance-level goldstandard containing negative examples extended_instance_goldstandard.tar.gz
Extended gold standard with manual fixes extendedv2.tar.gz
DBpedia subset dbpedia_subset.tar.gz

7. License

The correspondences of the T2D Gold standard is provided under the terms of the Apache license. The Web tables are provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. The DBpedia subset is licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License that applies to DBpedia.

6. Acknowledgements

We would like to thank Oktie Hassanzadeh, Mariano Rodriguez, Kavitha Srinivas and Michael J. Ward for their feedback on our gold standard.

8. Feedback

Please send questions and feedback to directly to the authors (listed above) or post them in the Web Data Commons Google Group.

9. References

  1. [Cafarella2008] Michael J. Cafarella, Eugene Wu, Alon Halevy, Yang Zhang, Daisy Zhe Wang: WebTables: exploring the power of tables on the web. VLDB 2008.
  2. [Crestan2011] Eric Crestan and Patrick Pantel: Web-scale table census and classification. WSDM 2011.
  3. [Cafarella2009] Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova: Data integration for the relational web. Proc. VLDB Endow. 2009.
  4. [Venetis2010] Venetis, Petros, Alon Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu: Table Search Using Recovered Semantics. 2010.
  5. [Limaye 2010] Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3, 1-2, 2010.
  6. [Zhang2013] Zhang, Xiaolu, et al.: Mapping entity-attribute web tables to web-scale knowledge bases. In: Database Systems for Advanced Applications. Springer, 2013.
  7. [Wang2012] Jingjing Wang, Haixun Wang, Zhongyuan Wang, and Kenny Q. Zhu: Understanding tables on the web. In Proceedings of the 31st international conference on Conceptual Modeling (ER'12), 2012.
  8. [Ellis2014]Jason Ellis, Achille Fokoue, Okite Hassanzadeh, Anastasios Kementsietsidis, Kavitha Srinivas, Michael J. Ward:Exploring Big Data with Helix: Finding Needles in a Big Haystack. In ACM SIGMOD Record, Volume 43 Issue 4, 2014.
  9. [Zhang2014]Ziqi Zhang:Towards efficient and effective semantic table interpretation. In Proceedings of the 13th International Semantic Web Conference (ISWC 2014), 2014.