Reducing Biases in Record Matching Through Scores Calibration

Record matching models typically output a real-valued matching score that is later consumed through thresholding, ranking, or human review. While fairness in record matching has mostly been assessed using binary decisions at a fixed threshold, such evaluations can miss systematic disparities in the entire score distribution and can yield conclusions that change with the chosen threshold. We introduce a threshold-independent notion of score bias that extends standard group-fairness criteria-demographic parity (DP), equal opportunity (EO), and equalized odds (EOD)-from binary outputs to score functions by integrating group-wise metric gaps over all thresholds. Using this metric, we empirically show that several state-of-the-art deep matchers can exhibit substantial score bias even when appearing fair at commonly used thresholds. To mitigate these disparities without retraining the underlying matcher, we propose two model-agnostic post-processing methods that only require score evaluations on an (unlabeled) calibration set. Calib targets DP by aligning minority/majority score distributions to a common Wasserstein barycenter via a quantile-based optimal-transport map, with finite-sample guarantees on both residual DP bias and score distortion. C-Calib extends this idea to label-dependent notions (EO/EOD) by performing barycenter alignment conditionally on an estimated label, and we characterize how its guarantees depend on both sample size and label-estimation error. Experiments on standard record-matching benchmarks and multiple neural matchers confirm that Calib and C-Calib substantially reduce score bias with minimal loss in accuracy.

翻译：记录匹配模型通常输出一个实值匹配分数，随后通过阈值处理、排序或人工审查进行使用。尽管记录匹配的公平性大多通过固定阈值下的二元决策进行评估，但此类评估可能忽略整个分数分布中的系统性差异，并可能因所选阈值不同而得出变化的结论。我们提出了一种与阈值无关的分数偏差概念，通过在所有阈值上积分组间度量差距，将标准群体公平性准则——人口统计均等（DP）、机会均等（EO）和均衡几率（EOD）——从二元输出扩展到分数函数。使用该度量，我们通过实证研究表明，即使多个最先进的深度匹配器在常用阈值下看似公平，仍可能表现出显著的分数偏差。为了在不重新训练底层匹配器的情况下缓解这些差异，我们提出了两种与模型无关的后处理方法，仅需在（未标注的）校准集上进行分数评估。Calib通过基于分位数的最优传输映射将少数/多数群体分数分布对齐至公共Wasserstein重心，以针对DP进行优化，并在残余DP偏差和分数失真方面提供有限样本保证。C-Calib通过基于估计标签的条件重心对齐，将此思想扩展到标签依赖性准则（EO/EOD），并刻画了其保证如何同时依赖于样本量和标签估计误差。在标准记录匹配基准和多种神经匹配器上的实验证实，Calib和C-Calib能在精度损失最小的情况下显著降低分数偏差。