Dataset Design, Statistics, and Download
Yaser Oulabi
Christian Bizer


This page describes the T4LTE dataset, a gold standard for the task of long-tail entity extraction from web tables.

Knowledge Bases, like DBpedia, Wikidata or Yago, all rely on data that has been extracted from Wikipedia and as a result cover mostly head instances that fulfill the Wikipedia notability criteria [Oulabi2019]. Their coverage of less well known instances from the long tail is rather low [Dong2014]. As the usefulness of a knowledge base increases with its completeness, adding long-tail instances to a knowledge base is an important task. Web tables [Cafarella2008], which are relational HTML tables extracted from the Web, contain large amounts of structured information, covering a wide range of topics, and describe very specific long tail instances. Web tables are thus a promising source of information for the task of augmenting cross-domain knowledge bases.

This dataset provides annotations for a selection of web tables for the task of augmenting the DBpedia knowledge base with new long-tail entities from those web tables. It includes annotations for the unique number of entities that can be derived from these web tables, and which of these entities are new, given the instances that are already covered in DBpedia. Additionally, there are annotations for values and facts that can be generated from web table data, allowing the evaluation of how well descriptions of new entities were created.

This dataset was used to develop and evaluate a method for augmenting a knowledge base with long-tail entities from web tables [Oulabi2019]. Using this dataset for training we were able to add 187 thousand new Song entities with 394 thousand facts, and 14 thousand new GridironFootballPlayer entities, with 44 thousand new facts to DBpeida. In regards to the number of instances this was an increase of 356 % and 67 % for Song and GridironFootballPlayer respectively [Oulabi2019].

Contents

1. Dataset Purpose

The purpose of this dataset is to act as a gold standard for evaluating the extraction of long-tail entities from web tables. It fulfills three tasks:

2. Knowledge Base and Class Selection

We employ DBpedia [Lehmann2015] as the target knowledge base to be extended. It is extracted from Wikipedia and especially Wikipedia infoboxes. As a result, the covered instances are limited to those identified as notable by the Wikipedia community [Oulabi2019]. We use the 2014 release of DBpedia, as this release has been used in related work [Ritze2015, Ritze2016, Oulabi2016, Oulabi2017], and its release date is also closer to the extraction of the web table corpus from which we created this dataset.

From DBpedia we selected three classes for which built the dataset. This selection was done based on four criteria:

Based on this approach we chose the following three classes: (1) GridironFootballPlayer (GF-Player), (2) Song and (3) Settlement, where the class Song also includes all instances of the class Single.

Given those three classes we will profile the existing entities within the knowledge base. The first table provides the number of instances and facts per class, while the second profiles the properties and their densities. The first table shows that DBpedia already covers tens of thousands of instances for the profiled classes. This could indicate that most of the well-known instances are already covered, so that we are especially interested in finding instances from the long tail.

Class Instances Facts
GF-Player 20751 137319
Song 52,533 315,414
Settlement 468,986 1,444,316

The following table reveals that the density differs significantly from property to property. We only consider head properties that have a density of at least 30 %.

Only the properties of class Song have consistently high densities larger than 60 %. The football player class has many properties, but half of them have a density below 50 %. The class Settlement suffers from both, a small number of properties, and low densities for some of them.

Class

Property

Facts

Density

GF-Player

birthDate

20,218

97.43 %

GF-Player

college

19,281

92.92 %

GF-Player

birthPlace

17,912

86.32 %

GF-Player

team

13,349

64.33 %

GF-Player

number

11,430

55.08 %

GF-Player

position

11,240

54.17 %

GF-Player

height

10,059

48.47 %

GF-Player

weight

10,027

48.32 %

GF-Player

draftYear

7,947

38.30 %

GF-Player

draftRound

7,932

38.22 %

GF-Player

draftPick

7,924

38.19 %





Song

genre

47,040

89.54 %

Song

musicalArtist

45,097

85.85 %

Song

recordLabel

43,053

81.95 %

Song

runtime

42,035

80.02 %

Song

album

40,666

77.41 %

Song

writer

33,942

64.61 %

Song

releaseDate

31,696

60.34 %





Settlement

country

433,838

92.51 %

Settlement

isPartOf

416,454

88.80 %

Settlement

populationTotal

292,831

62.44 %

Settlement

postalCode

154,575

32.96 %

Settlement

elevation

146,618

31.26 %

3. Web Table Corpus

We extract this dataset from the english-language relational tables set of the Web Data Commons 2012 Web Table Corpus. The set consists of 91.8 million tables. The Table below gives an overview of the general characteristics of tables in the corpus. We can see that the majority of tables are rather short, with an average of 10.4 rows and a median of 2, whereas the average and median number of columns are 3.5 and 3. As a result, a table on average describes 10 instances with 30 values, which likely is a sufficient size and potentially useful for finding new instances and their descriptions. In [Ritze2016] we have profiled the potential of the same corpus for the task of slot filling, meaning to find missing values for existing DBpedia instances.

 

Average

Median

Min

Max

Rows

10.37

2

1

35,640

Columns

3.48

3

2

713

For every table we assume that there is one attribute that contains the labels of the instances described by the rows. The remaining columns contain values, which potentially can be used to generate descriptions according to the knowledge base schema.

For the three evaluated classes, the following table shows the result of matching the web table corpus to existing instances and properties in DBpedia, using the T2K Matching Framework [Ritze2015, Ritze2016]. The first column shows the number of matched tables that have at least one matched attribute column. Rows of those tables were matched directly to existing instances of DBpedia. From the second and third columns we see how many values were matched to existing instances and how many values remained unmatched. While more values were matched, the number of unmatched values is still large, especially for the songs class.

Class

Tables

VMatched

VUnmatched

GF-Player

10,432

206,847

35,968

Song

58,594

1,315,381

443,194

Settlement

11,757

82,816

13,735

3. Dataset Creation

In this section we will outline the creation of the dataset. This includes how we selected tables from the corpus, what the labeling process is, and what annotations are included in the dataset.

3.1 Web Tables Selection

For the gold standard we had to select a certain number of tables per class to annotate. We first matched tables to classes in the knowledge base using the T2K framework [Ritze2015]. We then select the tables per class separately. To do this we first divided the instances in the knowledge base into quartiles of popularity using the indegree count based on a dataset of Wikipedia page-links [1, 2]. We then select three instances per quartile, overall 12 per class. We then look in the table corpus to find which labels in the corpus, for which we can not find a match in the knowledge base, co-occur most often with the label of the selected instance, as we select one label without a match for each of the 12 instances selected from the knowledge base. For both, the labels of the 12 knowledge base instances and the additional 12 "new" labels, we extract up to 15 tables per label, ensuring that few tables are chosen from the same PLD and that tables have a variety in their types of attributes.

3.2 Labeling Process

Using the method for table selection described above we ended up with 280, 217, 620 tables for the classes GridironFootballPlayer, Song and Settlement respectively. We did not label all tables and especially not all rows of these tables, but we looked for potentially new entities and entities with potentially conflicting names (homonyms). For those we then created clusters, by identifying the rows within the tables that describe the same real-world entitiy. From these row clusters entities can be created and added to the knowledge base. For each cluster we then identified whether the entity already exists in DBpedia or is a new entity, that can be added to DBpedia. For existing entities, we also added a correspondence to the URI of the entity in DBpedia.

For all web tables from which we created row clusters, we matched the table columns to the properties of the knowledge base. These property correspondences allow us to identify how many candidate values exist for a certain combination of entity and property. Finally, we also annotated for all clusters facts, i.e. the correct values given certain properties. We only annotated facts for properties for which a candidate value exists among the data in the table. I.e. for a row cluster, there is a row within a table that described within one column a certain property, only then did we annotate the correct fact. We also annotated whether the value of the correct fact was present among the values in the web tables.

When labeling rows, we aimed for labeling interesting row clusters first. As a result, most tables only have a small number rows in them labeled. This does not apply to columns. Whenever we label one row in a table, we always ensure to label all of its columns.

Finally, for the class Song, we include additional row cluster for existing entities for learning purposes only. These clusters are not fully labeled, as they are missing the fact annotations.

3.2. Annotation Types

The dataset contains various annotation types, which are all described in the table below.

Annotation Type Description Format
Table-To-Class Annotation All tables included in the dataset are matched to one of the three classes we chose to evaluate. The tables are placed in separate folders per class.
Row-To-Instance Annotation For a selection of row tables, we annotate if they belong to an instance, existing or new. If the instance described by the row already exists in DBpedia, the instance corresponds to the entity URI of that instance in DBpedia. Otherwise we generate a new random URI, but keep the prefix of DBpedia entity URIs. All rows matched to same instance form a row cluster. CSV file format (see 7.3)
New Instance Annotation We provide the list of entity URIs that we crated to describe new instances, that do not yet exist in DBpedia. LST file format (see 7.6)
Attribute-To-Property Annotation Given the columns of a web table, we annotate which of these columns describe information that corresponds to a property in the schema of the knowledge base. CSV file format (see 7.2)
Fact Annotations Given row clusters and attribute-to-property correspondences, we can determine for each entity, existing or new, described in the dataset, for which triples we have candidate values in the web tables. For these triples, we annotate the correct facts to allow for evaluation. We additionally annotate, whether the correct fact is present among the candidate values within the web table data. CSV file format (see 7.5)

5 Dataset Statistics

The following table provides an overview of the number of annotations in the dataset. In the first three columns we see the number of table, attribute and row annotations. On average, we have 1.85 attribute annotations per table, not counting the label attribute. The two following columns show the number of annotated clusters, followed by the number of values within those clusters that match a knowledge base property. We overall annotated 266 clusters, of which 103 are new, and where each cluster has on average 3.63 rows and 7.85 matched values. The last two columns show the number of unique facts that can be derived for those clusters, and the number of facts for which a correct value is present. Per cluster we can derive on average 3.17 facts, for 92 % of which, the correct value is present.

Class Tables Attributes Rows Existing
Clusters
New
Clusters
Matched
Values
Facts Correct
Value Present
GF-Player 192 572 358 80 17 1,177 460 436
Song 152 248 195 34 63 428 231 212
Settlement 188 162 413 51 23 487 152 124
Sum 532 982 966 165 103 2,092 843 772

The number of row clusters describing existing entities is low for Song, especially when compared to the number of those clusters describing new entities. This is not relevant for evaluation purposes, but for learning, more training examples for existing entities might be required. We therefore additionally include 15 existing entities for learning purposes only. Unlike the other existing entities, we did not annotate any facts for those entities. For these entities we also include 17 additional tables for the class Song, so that we have 169 overall tables for the class Song included in the dataset.


The following three tables show the distribution of properties among the matched values annotated and facts annotated in the dataset per Class. We notice that for all three classes there are clear head properties, for which much more values were matched than for the remaining properties. We also find that for some properties, we have barely any matched values. This number is especially high for the class Settlement.

GridironFootballPlayer Matched
Values
Facts Correct
Value Present
http://dbpedia.org/ontology/birthDate 50 37 33
http://dbpedia.org/ontology/birthPlace 5 5 3
http://dbpedia.org/ontology/college 246 82 81
http://dbpedia.org/ontology/draftPick 48 20 20
http://dbpedia.org/ontology/draftRound 10 10 10
http://dbpedia.org/ontology/draftYear 20 10 10
http://dbpedia.org/ontology/height 134 61 58
http://dbpedia.org/ontology/highschool 4 4 4
http://dbpedia.org/ontology/number 72 40 37
http://dbpedia.org/ontology/Person/weight 141 63 56
http://dbpedia.org/ontology/position 269 78 74
http://dbpedia.org/ontology/team 178 50 50


Song Matched
Values
Facts Correct
Value Present
http://dbpedia.org/ontology/album 98 54 50
http://dbpedia.org/ontology/bSide 1 1 1
http://dbpedia.org/ontology/genre 9 7 6
http://dbpedia.org/ontology/musicalArtist 167 64 64
http://dbpedia.org/ontology/producer 1 1 1
http://dbpedia.org/ontology/recordLabel 16 10 8
http://dbpedia.org/ontology/releaseDate 53 38 33
http://dbpedia.org/ontology/runtime 78 52 45
http://dbpedia.org/ontology/writer 5 4 4


Settlement Matched
Values
Facts Correct
Value Present
http://dbpedia.org/ontology/area 3 3 0
http://dbpedia.org/ontology/continent 2 2 2
http://dbpedia.org/ontology/country 156 22 22
http://dbpedia.org/ontology/elevation 4 2 0
http://dbpedia.org/ontology/isPartOf 158 50 50
http://dbpedia.org/ontology/populationDensity 3 3 0
http://dbpedia.org/ontology/populationMetro 8 7 0
http://dbpedia.org/ontology/populationTotal 30 21 10
http://dbpedia.org/ontology/postalCode 100 22 22
http://dbpedia.org/ontology/utcOffset 5 4 4
http://www.w3.org/2003/01/geo/wgs84_pos#long 9 8 6
http://www.w3.org/2003/01/geo/wgs84_pos#lat 9 8 8

6. Cross-Validation Splits

We use the gold standard for learning and testing. For this, we split the data into three folds and performed cross-validation in our research [Oulabi2019]. To allow for comparable results, we provide the exact folds used in our work.

We split by cluster, so that the rows of one cluster are always fully included in one fold. We ensured that we evenly split both new clusters and homonym groups. A homonym group, is a group of clusters with highly similar labels. All clusters of a homonym group were always placed in one fold.

Class Fold Clusters New Homonym Groups Clusters in Homonym Groups Rows
GridironFootballPlayer All 97 17 10 21 358
GridironFootballPlayer 0 31 5 3 6 126
GridironFootballPlayer 1 33 5 4 8 118
GridironFootballPlayer 2 33 7 3 7 114
             
Song All 97 63 20 65 195
Song 0 32 18 6 21 51
Song 1 34 24 7 27 86
Song 2 31 21 7 17 58
             
Settlement All 74 23 14 31 413
Settlement 0 26 7 5 11 150
Settlement 1 24 9 4 9 106
Settlement 2 24 7 5 11 157

7. Structure and File Formats

The dataset contains broadly three different file formats:

All files are encoded using UTF-8. All CSV files have no headers, use comma as separators and and double quotation marks as quotation characters. In LST files each new line corresponds to an entry in the list. No quotation or separation characters are used in LST files.

7.1 Dataset Directory Structure

The gold standard is split into three separate folder by each knowledge base class. These folders have the following structure:

CLASS_NAME (GridironFootballPlayer, Song, Settlement)

├─ attributeMapping
│ ├─ table1.csv
│ ├─ table2.csv
│ ├─ table3.csv
│ ├─ ...
│ └─ ...

├─ rowMapping
│ ├─ table1.csv
│ ├─ table2.csv
│ ├─ table3.csv
│ ├─ ...
│ └─ ...

├─ tables
│ ├─ table1.json
│ ├─ table2.json
│ ├─ table3.json
│ ├─ ...
│ └─ ...

├─ facts.csv
├─ fold0.lst
├─ fold1.lst
├─ fold2.lst
├─ forLearning.lst (Song only)
├─ newInstances.lst
├─ referencedEntities.csv (Song only)
└─ tableList.lst

7.2 Attribute Mapping CSV Format

The attribute mapping consists of files that describe correspondences between columns of tables included in the dataset and properties of the knowledge base. For each table we include one CSV file, where the name of the table corresponds to the name of the file without the ".csv" extension.

Each row of this file describes two values. The first value contains the web table column number, while the second contains the mapped DBpedia property. The first column of a table has the number 0.

Example: Song/attributeMapping/1346981172250_1346981783559_593.arc5217493555668914181#52592088_6_1764789873557608114.csv

"0","http://dbpedia.org/ontology/musicalArtist"
"3","http://dbpedia.org/ontology/releaseDate"
"4","http://dbpedia.org/ontology/recordLabel"
"6","http://dbpedia.org/ontology/genre"

7.3 Row Mappings CSV Format

The row mapping consists of files that describe which table rows correspond to which entity URI. For each table we include one CSV file, where the name of the table corresponds to the name of the file without the ".csv" extension.

Each row of this file describes two values. The first value contains the web table row number, while the second contains the full URI of the entity. The first row of the table, which is very often the header row, has the number 0.

Example: GridironFootballPlayer/rowMapping/1346876860779_1346961276986_5601.arc3719795019286941883#45451800_0_7520172109909715831.csv

"28","http://dbpedia.org/resource/Andrew_Sweat"
"32","http://dbpedia.org/resource/Ameet_Pall"
"46","http://dbpedia.org/resource/Jerrell_Wedge-00869aeb-d468-46fc-8a33-e11e6b771730"
"50","http://dbpedia.org/resource/Chris Donald-248fa1b2-6061-4e39-b394-4b0717de75b4"
"35","http://dbpedia.org/resource/shelly_lyons-c026bb63-4fa2-11e8-9b01-1d14cf16e545"
"24","http://dbpedia.org/resource/Brandon_Marshall_(linebacker)"
"23","http://dbpedia.org/resource/Jerrell_Harris"

7.4 Table JSON Format

The JSON files within the tables folder describe the individual tables included in the dataset fully, including rows that were not annotated as part of the dataset. The JSON Format is described further below using the example.These tables can also be found in the web table corpus linked above.

Two properties are important. First the relation property describes the actual content of the table. It is an array of arrays, where the outer array contains the columns of the table, and each inner array describes all rows of that column. The second important property is the keyColumnIndex, which is the property that sets which column is the key column of the table and is therefore linked to the Label property of the knowledge base.

Example: GridironFootballPlayer\tables\1346823846150_1346837956103_5045.arc6474871151262925852#91994528_4_1071800122125102457.json

{
   "hasKeyColumn":true,
   "headerRowIndex":0,
   "keyColumnIndex":0,
   "pld":"draftboardinsider.com",
   "url":"http://www.draftboardinsider.com/ncaateams/sec/auburn.shtml",
   "relation":[
      [
         "player",
         "ben grubbs",
         "kenny irons",
         "will herring",
         "david irons",
         "courtney taylor"
      ],
      [
         "pos",
         "og",
         "rb",
         "olb",
         "cb",
         "wr"
      ],
      [
         "pick",
         "29",
         "49",
         "161",
         "194",
         "197"
      ],
      [
         "nfl team",
         "baltimore ravens",
         "cincinnati bengals",
         "seattle seahawks",
         "atlanta falcons",
         "seattle seahawks"
      ]
   ]
}

7.5 Facts CSV Format

Each line in the facts file describes one individual annotated fact. Per line, there are four values. The first contains the URI of the entity, while the second contains the URI of the property. The third contains the annotated fact, while the last is a boolean flag, on whether the correct value of a fact is present among the values found in the web table data, where the values "true" and "false" correspond to present and not present respectively.

While for most facts, there is only one correct value present, for some there can be multiple correct values. Multiple values are separated by a simple |, and values need to be split accordingly when using the dataset.

Parsing the first two and the last values is simple, for the actual fact annotation, the parsing depends on the fact data-type. The table below provides parsing instructions:

Data-Type Description Format Example
Date A format describing date, either with a year or day granularity yyyy OR
yyyy-mm-dd
2000
2012-04-20
Reference DBpedia URI (needs to prefixed with http://dbpedia.org/resource/) No parsing required Nina_Simone
String Literal string No parsing required FH3312
Integer Integer numbers No parsing required 21
Decimal Mixed decimal number. Some numbers might not be be mixed and simple integers I.F
I
187.96
88.45051
4144
Signed Decimal Mixed decimal number with a sign �I.F +1.0
-4.5333333
Runtime A format describing runtime in minutes and seconds h:ss 4:01
5:13


This table provides a mapping between the properties and the data-types above. Additionally we provide some notes per property if applicable.

Class Property Data-Type Note
GridironFootballPlayer birthDate Date  
GridironFootballPlayer birthPlace Reference  
GridironFootballPlayer college Reference  
GridironFootballPlayer draftPick Integer  
GridironFootballPlayer draftRound Integer  
GridironFootballPlayer draftYear Date  
GridironFootballPlayer height Decimal We record height in centimeters, while DBpedia records height in meters, so that a conversion is necessary. Also, all tables exclusively record height in foot and inches.
GridironFootballPlayer highschool Reference  
GridironFootballPlayer number Integer  
GridironFootballPlayer Person/weight Decimal We record weight in kg, and so does DBpedia. All tables exclusively record weight in pounds.
GridironFootballPlayer position Reference  
GridironFootballPlayer team Reference  
       
Song album Reference  
Song bSide Reference  
Song genre Reference  
Song musicalArtist Reference  
Song producer Reference  
Song recordLabel Reference  
Song releaseDate Date  
Song runtime Time DBpedia records runtime in seconds as a simple numeric property, while we record it as time in minutes and seconds. As a result, a conversion is necessary.
Song writer Reference  
       
Settlement area Decimal  
Settlement continent Reference  
Settlement country Reference  
Settlement elevation Decimal  
Settlement isPartOf Reference  
Settlement populationDensity Decimal  
Settlement populationMetro Decimal  
Settlement populationTotal Decimal  
Settlement postalCode String  
Settlement utcOffset Signed Decimal  
Settlement wgs84_pos#long Signed Decimal  
Settlement wgs84_pos#lat Signed Decimal  


Below you will find some examples of the facts CSV file for all three classes.

Example: GridironFootballPlayer/facts.csv

"http://dbpedia.org/resource/Al_Harris_(defensive_lineman)","http://dbpedia.org/ontology/position","Defensive_end|Linebacker","true"
"http://dbpedia.org/resource/Allen_Reisner","http://dbpedia.org/ontology/birthDate","1988-09-29","true"
"http://dbpedia.org/resource/Mike_Bell_(defensive_lineman)","http://dbpedia.org/ontology/college","Colorado_State_Rams","true"
"http://dbpedia.org/resource/Andre_Roberts_(American_football)","http://dbpedia.org/ontology/Person/weight","88.45051","true"
"http://dbpedia.org/resource/louis_nzegwu-90617f1a-1dbf-48c0-ae52-dfb4cb5043ab","http://dbpedia.org/ontology/team","Atlanta_Falcons","true"
"http://dbpedia.org/resource/Donald_Jones_(American_football)","http://dbpedia.org/ontology/birthDate","1987-12-17","true"
"http://dbpedia.org/resource/Mike_Williams_(wide_receiver,_born_1987)","http://dbpedia.org/ontology/height","187.96","true"
"http://dbpedia.org/resource/Anquan_Boldin","http://dbpedia.org/ontology/team","Arizona_Cardinals|Baltimore_Ravens|San_Francisco_49ers|Detroit_Lions|Buffalo_Bills","true"
"http://dbpedia.org/resource/Al_Harris_(defensive_lineman)","http://dbpedia.org/ontology/team","Chicago_Bears","true"
"http://dbpedia.org/resource/Mike_Williams_(wide_receiver,_born_1984)","http://dbpedia.org/ontology/draftPick","10","true"
...

Example: Song/facts.csv

"http://dbpedia.org/resource/rhythm_of_life-17f821d8-8424-49b9-ad35-aae0094d475c","http://dbpedia.org/ontology/musicalArtist","U96","true"
"http://dbpedia.org/resource/Men's_Needs","http://dbpedia.org/ontology/musicalArtist","The_Cribs","true"
"http://dbpedia.org/resource/The_Lemon_Song","http://dbpedia.org/ontology/runtime","6:19","true"
"http://dbpedia.org/resource/lemon_tree-c4ef5525-b118-4ed9-8206-2d64b91a0b89","http://dbpedia.org/ontology/album","The_Very_Best_of_Peter,_Paul_and_Mary-f3c362a9-764f-45b2-a3b4-dc32f71c8902","true"
"http://dbpedia.org/resource/Seek_&_Destroy","http://dbpedia.org/ontology/album","Kill_%27Em_All","true"
"http://dbpedia.org/resource/I'm_Ready_for_Love","http://dbpedia.org/ontology/musicalArtist","Martha_and_the_Vandellas","true"
"http://dbpedia.org/resource/Something_About_You-646ffccf-6fb9-4279-9b84-eb582b959388","http://dbpedia.org/ontology/album","Aliens_&_Rainbows","true"
"http://dbpedia.org/resource/Beautiful_(Mai_Kuraki_song)","http://dbpedia.org/ontology/releaseDate","2009-06-10","true"
"http://dbpedia.org/resource/Lemon_(song)","http://dbpedia.org/ontology/releaseDate","1993","true"
...

Example: Settlement/facts.csv

"http://dbpedia.org/resource/Beijing","http://www.w3.org/2003/01/geo/wgs84_pos#long","116.383333","true"
"http://dbpedia.org/resource/Rome","http://dbpedia.org/ontology/area","1285","false"
"http://dbpedia.org/resource/Bakar","http://dbpedia.org/ontology/country","Croatia","true"
"http://dbpedia.org/resource/Rome","http://dbpedia.org/ontology/utcOffset","+1.0","true"
"http://dbpedia.org/resource/arriondas-a4a0fc90-8a84-11e8-82e1-3fad23f94135","http://dbpedia.org/ontology/country","Spain","true"
"http://dbpedia.org/resource/Bwaga_Cheti","http://www.w3.org/2003/01/geo/wgs84_pos#lat","-4.5333333","true"
"http://dbpedia.org/resource/belec-a4a20efb-8a84-11e8-82e1-55488444b4bb","http://dbpedia.org/ontology/postalCode","49254","true"
"http://dbpedia.org/resource/Bakarac","http://dbpedia.org/ontology/isPartOf","Primorje-Gorski_Kotar_County","true"
...

7.6 Table, New Instance, Folds and for Learning Lists

We use the list format for a different number of types:

These list files have the extension .lst. Each line of the file is another entry in the list. There are no quoting characters.

Example: Settlement/fold0.lst

http://dbpedia.org/resource/Aurel,_Vaucluse
http://dbpedia.org/resource/Stamford,_Lincolnshire
http://dbpedia.org/resource/Bwaga_Cheti
http://dbpedia.org/resource/kalakho-f889b410-4fa2-11e8-9b01-4b7e9cb868f5
http://dbpedia.org/resource/Belica,_Me%C4%91imurje_County
http://dbpedia.org/resource/burgo_ranero_(el)-a4a1e6e3-8a84-11e8-82e1-f765c0073ff5
http://dbpedia.org/resource/Parys
http://dbpedia.org/resource/Bakar
http://dbpedia.org/resource/Beli_Manastir
http://dbpedia.org/resource/Chaville
http://dbpedia.org/resource/Bonyunyu
...

Example: Song/tableList.lst

1346981172231_1347009637666_1623.arc2035420803784423551#8157737_1_7371407078293434892
1346981172137_1346990327917_1687.arc8790234217643537183#29324093_0_4104648016207008655
1350433107059_1350464532277_262.arc3274150641837087721#45894422_0_4048022465851316720
1346876860798_1346941853400_2953.arc2527404313287902461#59379225_0_6616355908335718299
1346876860596_1346938127566_2223.arc3753870089959127664#66546593_0_554336699268001312
1346876860493_1346903186037_233.arc7106100585551357027#48571984_0_6773473850340800215
1350433107058_1350501694041_732.arc1602187029723264891#37777045_0_5887313017136165099
1346981172186_1346995420474_2788.arc8792188372387791527#96007354_0_5596511497072590105
1346876860840_1346953273866_1315.arc7476207801019051251#90975928_0_7687754714967118394
1346981172231_1347010609872_3515.arc2403866143077224377#1524220_0_7599368370767966283
1346876860596_1346938903566_3154.arc4273244386981436402#87484324_1_1397156714755041772
1346876860611_1346928074552_1536.arc941312314634454173#7668872_0_4460134851750954295
1346876860807_1346934722734_147.arc4322826721152635511#74438125_0_3796119154304144126
1346981172155_1347002310061_2451.arc2996742124595566891#69073711_0_6127538729831462210
1346981172239_1346995950584_63.arc1630314548530234317#65396150_0_5145880845606151839
...

Example: GridironFootballPlayer/newInstances.lst

http://dbpedia.org/resource/shelly_lyons-c026bb63-4fa2-11e8-9b01-1d14cf16e545
http://dbpedia.org/resource/mike_ball-86931519-f533-4e99-b1be-b45fb805e7e5
http://dbpedia.org/resource/michael_vandermeulen-c020ef8f-4fa2-11e8-9b01-c3c34fe696e5
http://dbpedia.org/resource/Chris Donald-248fa1b2-6061-4e39-b394-4b0717de75b4
http://dbpedia.org/resource/james_carmon-c02869e4-4fa2-11e8-9b01-81f205eb29eb
http://dbpedia.org/resource/alvin_mitchell-33d17a8e-ed7d-433e-9728-3e1c26658a6a
http://dbpedia.org/resource/ben_buchanan-c0289116-4fa2-11e8-9b01-fdd98ba7cc66
http://dbpedia.org/resource/merritt_kersey-c025d110-4fa2-11e8-9b01-65117e7d5512
http://dbpedia.org/resource/aderious_simmoms-c023ae8d-4fa2-11e8-9b01-89f1d0e15a08
...

7.7 Referenced Entities

For the class Song there exist reference facts that reference entities, that do not exist in DBpedia, i.e. they are long-tail entities themselves. We provide these additional entities in a seperate file. The file is especially useful, as it provides both labels and the class of those referenced entities.

For this file we again use the CSV file format, with three values. The first is the URI of the referenced entity, the second its label, and the third, its class alignment.

Example: Song/referencedEntities.csv

"http://dbpedia.org/resource/Shelley_Laine-8ee3f06c-a68a-495b-a788-5a4473e39384","Shelley Laine","MusicalArtist"
"http://dbpedia.org/resource/Skipping_Stones-b3ff0194-6a65-4835-8635-f02ba6d58e3d","Skipping Stones","Album"
"http://dbpedia.org/resource/Best_Of_1991-2001-ab96f5ab-730a-48a2-a0b9-e3275993bf07","Best Of 1991-2001","Album"
"http://dbpedia.org/resource/Terry_Steele-67a5c337-4b29-497e-9852-28ae222c7bfd","Terry Steele","Writer"
"http://dbpedia.org/resource/David_L._Elliott-b11a4b31-8e01-4c06-8847-af359577525a","David L. Elliott","Writer"
"http://dbpedia.org/resource/Anthology:_1965-1972-013145ea-382f-448e-ae7f-dfd3d959a2e0","Anthology: 1965-1972","Album"
"http://dbpedia.org/resource/Bangarang_(EP)-b3bf5bc8-454a-47b6-9b4a-8b8b1b53f728","Bangarang_(EP)","Album"
"http://dbpedia.org/resource/Old_School_New Rules-34473575-5c29-4048-8435-f96717404db7","Old School New Rules","Album"
"http://dbpedia.org/resource/Wanna-d7a7259b-da78-41cd-bcac-bcd51ff040f2","Wanna","MusicalWork"
...

8. Download

You can download the dataset here: T4LTE.zip

9. Feedback

Please send questions and feedback to directly to the authors (listed above) or post them in the Web Data Commons Google Group.

10. References

  1. [Cafarella2008] Cafarella, Michael J. and Halevy, Alon Y. and Zhang, Yang and Wang, Daisy Zhe and Wu, Eugene (2008), "Uncovering the Relational Web", In WebDB '08.
  2. [Dong2014] Dong, Xin and Gabrilovich, Evgeniy and Heitz, Geremy and Horn, Wilko and Lao, Ni and Murphy, Kevin and Strohmann, Thomas and Sun, Shaohua and Zhang, Wei (2014), "Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion", In KDD '14.
  3. [Lehmann2015] Lehmann, Jens and Isele, Robert and Jakob, Max and Jentzsch, Anja and Kontokostas, Dimitris and Mendes, Pablo N and Hellmann, Sebastian and Morsey, Mohamed and Van Kleef, Patrick and Auer, S�ren and others (2015), "DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia", Semantic Web. Vol. 6(2), pp. 167-195. IOS Press.
  4. [Oulabi2016] Oulabi, Yaser and Meusel, Robert and Bizer, Christian (2016), "Fusing Time-dependent Web Table Data", In WebDB '16.
  5. [Oulabi2017] Oulabi, Yaser and Bizer, Christian (2017), "Estimating missing temporal meta-information using Knowledge-Based-Trust", In KDWeb '17.
  6. [Oulabi2019] Oulabi, Yaser and Bizer, Christian (2019), "Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data", In EDBT '19.
  7. [Ritze2015] Ritze, Dominique and Lehmberg, Oliver and Bizer, Christian (2015), "Matching HTML Tables to DBpedia", In WIMS '15.
  8. [Ritze2016] Ritze, Dominique and Lehmberg, Oliver and Oulabi, Yaser and Bizer, Christian (2016), "Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases", In WWW '16.

Released: 15.07.19