Record matching models typically output a real-valued matching score that is later consumed through thresholding, ranking, or human review. While fairness in record matching has mostly been assessed using binary decisions at a fixed threshold, such evaluations can miss systematic disparities in the entire score distribution and can yield conclusions that change with the chosen threshold. We introduce a threshold-independent notion of score bias that extends standard group-fairness criteria-demographic parity (DP), equal opportunity (EO), and equalized odds (EOD)-from binary outputs to score functions by integrating group-wise metric gaps over all thresholds. Using this metric, we empirically show that several state-of-the-art deep matchers can exhibit substantial score bias even when appearing fair at commonly used thresholds. To mitigate these disparities without retraining the underlying matcher, we propose two model-agnostic post-processing methods that only require score evaluations on an (unlabeled) calibration set. Calib targets DP by aligning minority/majority score distributions to a common Wasserstein barycenter via a quantile-based optimal-transport map, with finite-sample guarantees on both residual DP bias and score distortion. C-Calib extends this idea to label-dependent notions (EO/EOD) by performing barycenter alignment conditionally on an estimated label, and we characterize how its guarantees depend on both sample size and label-estimation error. Experiments on standard record-matching benchmarks and multiple neural matchers confirm that Calib and C-Calib substantially reduce score bias with minimal loss in accuracy.
翻译:记录匹配模型通常输出一个实值匹配分数,随后通过阈值处理、排序或人工审查进行使用。尽管记录匹配的公平性大多通过固定阈值下的二元决策进行评估,但此类评估可能忽略整个分数分布中的系统性差异,并可能因所选阈值不同而得出变化的结论。我们提出了一种与阈值无关的分数偏差概念,通过在所有阈值上积分组间度量差距,将标准群体公平性准则——人口统计均等(DP)、机会均等(EO)和均衡几率(EOD)——从二元输出扩展到分数函数。使用该度量,我们通过实证研究表明,即使多个最先进的深度匹配器在常用阈值下看似公平,仍可能表现出显著的分数偏差。为了在不重新训练底层匹配器的情况下缓解这些差异,我们提出了两种与模型无关的后处理方法,仅需在(未标注的)校准集上进行分数评估。Calib通过基于分位数的最优传输映射将少数/多数群体分数分布对齐至公共Wasserstein重心,以针对DP进行优化,并在残余DP偏差和分数失真方面提供有限样本保证。C-Calib通过基于估计标签的条件重心对齐,将此思想扩展到标签依赖性准则(EO/EOD),并刻画了其保证如何同时依赖于样本量和标签估计误差。在标准记录匹配基准和多种神经匹配器上的实验证实,Calib和C-Calib能在精度损失最小的情况下显著降低分数偏差。