Alexander Brinkmann
Roee Shraga
Christian Bizer

This page offers the WDC Block benchmark for download. WDC Block is a benchmark for comparing the performance of blocking methods that are used as part of entity resolution pipelines. WDC Block features a maximum Cartesian product of 200 billion pairs as well as training sets of different sizes for evaluating supervised blockers. We use WDC Block to evaluate several state-of-the-art blocking systems, including CTT, Auto, JedAI, Sudowoodo, SBERT, BM25 and SC-Block.

News

Table of Contents

1 Introduction

Entity resolution aims to identify records in two datasets (A and B) that describe the same real-world entity [2,3,4]. Since comparing all record pairs between two datasets can be computationally expensive, entity resolution is approached in two steps, blocking and matching. Blocking applies a computationally cheap method to remove non-matching record pairs and produces a smaller set of candidate record pairs reducing the workload of the matcher. During matching a more expensive pair-wise matcher produces a final set of matching record pairs[2, 8]. Existing benchmark datasets for blocking and matching are rather small with respect to the Cartesian product AxB for comparing all records and the vocabulary size[7]. If blockers are evaluated only on these small datasets, effects resulting from a high number of records or from a large vocabulary size (large number of unique tokens that need to be indexed) may be missed. The Web Data Commons Block (WDC-Block) is a new blocking benchmark that provides much larger datasets and thus requires blockers that address these scalability challenges. Additionally, we provide three development sets with different sizes (~1K pairs, ~5K pairs & ~20K pairs) to experiment with different amounts of training data for the blockers.

WDC Block is based on product data that has been extracted in 2020 from 3,259 e-shops that mark up product offers within their HTML pages using the schema.org vocabulary. The largest variant of WDC Block uses offers from 2 million different products. Multiple offers referring to the same product are identified based on GTIN and MPN numbers provided by the e-shops.

2 Benchmark Creation

The WDC Block benchmark was created in three steps: (i) we select a difficult variant of the WDC Products entity matching benchmark as the seed dataset for WDC Block, (ii) we split the dataset into two separate datasets A and B, (iii) we enlarge the dataset by adding offers for additional non-matching products from the WDC Product Data Corpus V2020, (iv) we prepare three development sets (~1K pairs, ~5k pairs & ~20k pairs). This section gives an overview of the four steps.

  1. We choose the large, pairwise, 80% corner cases and 20% random training set and the 50%-unseen test set from the WDC Products entity matching benchmark as the seed dataset for our benchmark. We selected the large version to start with a large number of product offers that are difficult to match and th e50% seen/ 50% unseen test set to have a trade-off between product offers that are part of the training set and product offers that the blockers did not see during training.
  2. We split the original record pairs into two datasets A and B to follow the common setup of entity matching datasets in the related work. Both datasets A and B are deduplicated to obtain clean datasets.
  3. Depending on the specific benchmark (small, medium, large), we populate the datasets A and B with additional randomly selected offers from the WDC Product Data Corpus V2020. We make sure that the randomly selected records do not match any of the existing records in the datasets to avoid introducing additional matching pairs.
  4. We derive the splits train, validation & test and transfer them to the format of the entity matching datasets in the related work. For the large development set (~20k pairs), the initially derived pairs are retained. For the medium development set (~5k pairs) & small development set (~1k pairs), the train & validation set are down-sampled such that the distributions of the large development set are comparable. The test set remains the same for all versions of this benchmark to make the results comparable.

3 Benchmark Profiling

The WDC Block benchmark consists of a total of 2,073,224 real-word products across all subsets which are described by 2,100,000 product offers. Each product offer in WDC Block has five attributes, title, description, price, priceCurrency and brand. WDC Block comes with nine configurations, which are derived of the two dimensions dataset size (small, medium, large) described in Table 1 and development set sizes (~1k, ~5k, ~20k) described in Table 2. Table 1 shows the number of records in Table A and Table B, the number of positive and negative pairs in the test set as well as the vocabulary size and the Cartesian product of the different dataset sizes. The vocabulary size represents the number of unique tokens after concatenating the attribute values of the product offers in Table A and Table B and tokenizing the concatenated attribute values by whitespace. The cartesian product is the maximum number of record comparisons between Table A and Table B (AxB).

Table 1: Benchmark Statistics.
Dataset Table A Table B Pos. Pairs Test Neg. Pairs Test Vocabulary Size Cartesian Product
WDC Block small 5,000 5,000 500 4,000 67,294 25M
WDC Block medium 5,000 200,000 500 4,000 1,174,280 1,000M
WDC Block large 100,000 2,000,000 500 4,000 6,880,107 200,000M
Table 2: Development Set Statistics.
Development Set Size Pos. Pairs Train Neg. Pairs Train Pos. Pairs Val. Neg. Pairs Val. Total Pairs
Small ~1k 266 408 133 203 1,011
Medium ~5k 1,559 1,880 779 939 5,192
Large ~20k 6,454 10,502 3,226 5,250 21,932

4 Experiments

To demonstrate the usefulness of WDC Block, we evaluate the blocking methods BM25, BM25 with trigrams (BM25-3), CTT [4], Auto [4], JedAI [3], SBERT [5], Barlow Twins [6], SimCLR [6], and SC-Block[1] on WDC-Block. Detailed explanations of the experiments can be found in the corresponding paper [1].

4.1 Blocking Systems

4.2 Benchmark Results for nearest neighbour blockers with π‘˜ = 5

We first analyze all nearest neighbour blockers with a fixed number of nearest neighbours π‘˜ = 5. We use recall and precision to evaluate the candidate sets with respect to the test sets of the datasets. By fixing the hyperparameter π‘˜ differences in recall and precision become visible that are not visible if π‘˜ is tuned. The recall and precision results in Table 3 show that π‘˜ = 5 is not sufficient for the benchmarked blockers to achieve a recall above 75%. The results also show that both increasing the dataset size and decreasing the development set size increase the difficulty of the benchmark.

Table 3: Recall (R) and Precision (P) of the candidate sets generated by all nearest neighbour blockers with π‘˜ = 5 on the test sets of the datasets. The highest Recall and Precision values are marked in bold. ’timeout’ indicates a timeout after 48h and ’OOM’ indicates an out-of-memory error.
WDC Blocksmall WDC Blockmedium WDC Blocklarge
Blocker Recall Precision Recall Precision Recall Precision
BM25 59.42% 39.73% 53.81% 45.28% 41.70% 54.39%
BM253 52.47% 38.87% 46.64% 43.88% time out time out
SC BlockLarge Dev. Set 71.52% 57.27% 66.37% 63.52% 56.73% 74.41%
SC BlockMedium Dev. Set 58.52% 51.58% 50.67% 58.25% 37.00% 62.74%
SC BlockSmall Dev. Set 42.15% 46.88% 26.46% 46.83% 17.49% 47.56%
Barlow 31.61% 34.73% 21.30% 35.58% 12.56% 33.14%
SimCLR 34.75% 39.24% 21.08% 38.52% 2.69% 33.33%
SBERTLarge Dev. Set 45.29% 48.91% 34.98% 55.12% 24.44% 56.77%
Auto 43.20% 39.48% 35.80% 40.68% OOM OOM
CTT 42.60% 38.10% 34.80% 40.56% OOM OOM

4.3 Benchmark Results for nearest neighbour blockers with recall >99.5% on validation set

We analyze how tuning the hyperparameter π‘˜ affects the recall of the nearest neighbour blockers on WDC Block. Increasing π‘˜ increases the probability of finding a matching pair resulting in a higher recall. Higher values of π‘˜ produce larger candidate sets because the matcher has more candidate pairs to compare. This prolongs the matching phase of the entity resolution pipeline. Our main goal in tuning π‘˜ is to add all matching pairs to the candidate set while keeping the candidate set as small as possible. To achieve this goal, we evaluate each nearest neighbour blocker with increasing values of π‘˜ starting with π‘˜ = 1 on the validation set. Once the recall of the candidate set exceeds 99.5% on the validation set, π‘˜ is found. To limit the search space, we set a maximum value of π‘˜ = 50 on WDC-Blocksmall , π‘˜ = 100 on WDC-Blockmedium and π‘˜ = 200 on WDC-Blocklarge.

The results in Table 4 show how challenging WDC Block is for the blockers because except for SC BlockLarge Dev. Set all blockers fail to generate candidate sets that exceed the 99.5% recall threshold on the validation set. WDC Blocklarge is the most challenging benchmark dataset due to the high number of records and the large vocabulary of unique tokens. The blockers BM253, JedAI, Auto and CTT fail to generate candidate sets on WDC Blocklarge due to time and memory constraints.

Table 4: π‘˜ per blocker and dataset after tuning π‘˜ for recall 99.5% on the respective validation set. Recall (R) and candidate set size (|C|) of all blockers on the test set. The highest recall, as well as the lowest π‘˜ and |C| values per dataset, are marked in bold. ’timeout’ indicates a timeout after 48h and ’OOM’ indicates an out-of-memory error.
WDC Blocksmall WDC Blockmedium WDC Blocklarge
Blocker k Recall |C| k Recall |C| k Recall |C|
BM25 50 96.86% 250k 100 97.76% 500k 50 83.18% 20M
BM253 50 94.17% 250k 100 93.95% 500k timeout
SC BlockLarge Dev. Set 14 93.50% 70k 20 91.93% 100k 50 89.46% 5M
SC BlockMedium Dev. Set 50 92.60% 250k 100 86.55% 500k 50 77.80% 500k
SC BlockSmall Dev. Set 50 71.75% 250k 100 52.02% 500k 50 42.20% 500k
Barlow 50 66.59% 250k 100 42.60% 500k 50 27.80% 500k
SimCLR 50 69.51% 250k 100 45.96% 500k 50 36.10% 500k
JedAI - 55.40% 51k - 80.60% 561k timeout
SBERTLarge Dev. Set 50 78.48% 250k 100 63.39% 500k 50 58.74% 500k
Auto 50 85.20% 250k 100 80% 500k out-of-memory
CTT 50 83% 250k 100 78% 500k out-of-memory

5 Downloads

We offer the WDC Block benchmark for public download. The benchmark is available as a single zip file for each configuration. Each dataset contains the two datasets A and B as well as the train, validation and test split.

The code used for the creation of the corpus can be found on github.

The files are represented using the CSV format and can for example be easily processed using the pandas Python library.

import pandas as pd
df = pd.read_csv('file_name.csv')
Dataset Size Small development set (~1k) Size Medium development set (~5k) Size Large development set (~20k) Size
Small Size S, Train S 2.2MB Size S, Train M 2.2MB Size S, Train L 2.5MB
Medium DS M, Train S 38MB DS M, Train M 38MB DS M, Train L 39MB
Large DS L, Train S 385MB DS L, Train M 385MB DS L, Train L 385MB

6 Feedback

Please send questions and feedback to the Web Data Commons Google Group.

More information about Web Data Commons is found here.

7 References


[1]
A. Brinkmann, R.Shraga and C. Bizer, 2023, "SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines", arXiv:2303.03132 [cs].
[2]
P. Christen, 2012, "Data matching : concepts and techniques for record linkage", entity resolution, and duplicate detection. Springer, Berlin, Heidelberg.
[3]
V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis, 2021, "An Overview of End-to-End Entity Resolution for Big Data", in ACM Comput. Surv. 53, 6 (Nov. 2021), 1–42.
[4]
S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, and A. Doan, 2021, "Deep learning for blocking in entity matching: a design space exploration" in Proceedings of the VLDB Endowment, vol. 14, no. 1, pp. 2459–2472.
[5]
N. Reimers and I. Gurevych, 2019, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" in Proceedings of Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992.
[6]
R. Wang, Y. Li, and J. Wang, 2022, "Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation" on arXiv.
[7]
H. KΓΆpcke, A. Thor, and E. Rahm, 2010, "Evaluation of entity resolution approaches on real-world match problems" in Proceedings of the VLDB Endowment, vol. 3, no. 1, pp. 484–493.
[8]
G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas, 2021, "Blocking and Filtering Techniques for Entity Resolution: A Survey" in ACM Computing Surveys, vol. 53, no. 2, pp. 1–42.
[9]
V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, and T. Palpanas, 2017, "Parallel meta-blocking for scaling entity resolution over big heterogeneous data" in Information Systems, vol. 53, pp. 137–157.
[10]
D. Paulsen, Y. Govind, and A. Doan., 2023, "Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching" in Proceedings of the VLDB Endowment, vol. 16, pp. 1507–1519.