Entity Matching (EM) is crucial for identifying equivalent data entities across different sources, a task that becomes increasingly challenging with the growth and heterogeneity of data. Blocking techniques, which reduce the computational complexity of EM, play a vital role in making this process scalable. Despite advancements in blocking methods, the issue of fairness; where blocking may inadvertently favor certain demographic groups; has been largely overlooked. This study extends traditional blocking metrics to incorporate fairness, providing a framework for assessing bias in blocking techniques. Through experimental analysis, we evaluate the effectiveness and fairness of various blocking methods, offering insights into their potential biases. Our findings highlight the importance of considering fairness in EM, particularly in the blocking phase, to ensure equitable outcomes in data integration tasks.
翻译:实体匹配(EM)对于识别不同数据源中的等价实体至关重要,随着数据规模和异质性的增长,这一任务变得日益复杂。分块技术通过降低EM的计算复杂度,在实现该过程的可扩展性方面发挥着关键作用。尽管分块方法已取得进展,但其公平性问题——即分块可能无意中偏向特定人口群体——在很大程度上被忽视。本研究将传统分块指标扩展至包含公平性考量,提出了评估分块技术偏见的框架。通过实验分析,我们评估了多种分块方法的有效性与公平性,揭示了其潜在偏见。我们的研究结果强调了在EM(尤其是分块阶段)考虑公平性的重要性,以确保数据集成任务中的公正结果。