The difficulty of an entity matching task depends on a combination of multiple factors such as the amount of corner-case pairs, the fraction of entities in the test set that have not been seen during training, and the size of the development set. Current entity matching benchmarks usually represent single points in the space along such dimensions or they provide for the evaluation of matching methods along a single dimension, for instance the amount of training data. This paper presents WDC Products, an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-world data. The three dimensions are (i) amount of corner-cases (ii) generalization to unseen entities, and (iii) development set size (training set plus validation set). Generalization to unseen entities is a dimension not covered by any of the existing English-language benchmarks yet but is crucial for evaluating the robustness of entity matching systems. Instead of learning how to match entity pairs, entity matching can also be formulated as a multi-class classification task that requires the matcher to recognize individual entities. WDC Products is the first benchmark that provides a pair-wise and a multi-class formulation of the same tasks. We evaluate WDC Products using several state-of-the-art matching systems, including Ditto, HierGAT, and R-SupCon. The evaluation shows that all matching systems struggle with unseen entities to varying degrees. It also shows that for entity matching contrastive learning is more training data efficient compared to cross-encoders.
翻译:实体匹配任务的难度取决于多种因素的组合,例如极限情况样本对的数量、测试集中未见过的实体比例以及开发集的大小。当前的实体匹配基准通常仅代表这些维度空间中的单一维度点,或仅提供沿单一维度的匹配方法评估(例如训练数据量)。本文提出WDC Products实体匹配基准,该基准基于真实世界数据,能够沿三个维度的组合对匹配系统进行系统评估。这三个维度包括:(i) 极限情况样本数量,(ii) 对未见实体的泛化能力,以及(iii) 开发集大小(训练集与验证集之和)。对未见实体的泛化能力是现有英文基准尚未覆盖的维度,但却是评估实体匹配系统鲁棒性的关键。实体匹配不仅可以通过学习实体配对关系实现,还可被形式化为多分类任务,要求匹配器识别独立实体。WDC Products是首个同时提供成对匹配与多分类任务形式的基准。我们利用Ditto、HierGAT和R-SupCon等多种先进的匹配系统对WDC Products进行评估。评估结果表明,所有匹配系统在不同程度上难以处理未见实体,同时相较于交叉编码器,对比学习在实体匹配中具有更高的训练数据利用效率。