Entity matching (EM) is a challenging problem studied by different communities for over half a century. Algorithmic fairness has also become a timely topic to address machine bias and its societal impacts. Despite extensive research on these two topics, little attention has been paid to the fairness of entity matching. Towards addressing this gap, we perform an extensive experimental evaluation of a variety of EM techniques in this paper. We generated two social datasets from publicly available datasets for the purpose of auditing EM through the lens of fairness. Our findings underscore potential unfairness under two common conditions in real-world societies: (i) when some demographic groups are overrepresented, and (ii) when names are more similar in some groups compared to others. Among our many findings, it is noteworthy to mention that while various fairness definitions are valuable for different settings, due to EM's class imbalance nature, measures such as positive predictive value parity and true positive rate parity are, in general, more capable of revealing EM unfairness.
翻译:实体匹配(EM)是一个跨学科研究超过半个世纪的具有挑战性的问题。算法公平性已成为解决机器偏见及其社会影响的及时研究课题。尽管这两个领域已有大量研究,但实体匹配的公平性问题仍未得到充分关注。为填补这一空白,本文对多种实体匹配技术进行了广泛的实验评估。我们从公开数据集中构建了两个社会数据集,旨在通过公平性视角审视实体匹配。研究结果揭示了现实社会中两种常见情境下潜在的不公平现象:(i)当某些人口群体被过度代表时,(ii)当某些群体的姓名相似度显著高于其他群体时。值得特别指出的是,尽管不同公平性定义在不同场景下各有价值,但由于实体匹配的类别不平衡特性,积极预测值均等性和真阳性率均等性等指标通常更能有效揭示实体匹配的不公平性。