Fair calibration is a widely desirable fairness criteria in risk prediction contexts. One way to measure and achieve fair calibration is with multicalibration. Multicalibration constrains calibration error among flexibly-defined subpopulations while maintaining overall calibration. However, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it is possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose \emph{proportional multicalibration}, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its \emph{differential calibration}, a fairness criteria that directly measures how closely a model approximates sufficiency. Therefore, proportionally calibrated models limit the ability of decision makers to distinguish between model performance on different patient groups, which may make the models more trustworthy in practice. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a real-world application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultaneous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.
翻译:摘要:公平校准是风险预测场景中广泛追求的公平性准则之一。多校准是实现公平校准的一种重要度量与实现方法,它在保持整体校准性的同时,限制灵活定义的子群体内的校准误差。然而,相较于高基率群体,多校准模型在低基率群体中可能表现出更高的百分比校准误差。这可能导致决策者针对特定群体产生对模型预测的信任或怀疑倾向。为解决此问题,我们提出*比例多校准*准则,该准则通过约束子群体内及预测分箱中的百分比校准误差来实现。我们证明满足比例多校准准则可同时限定模型的*差异校准*(一种直接衡量模型对充分性近似程度的公平性指标)与多校准性。因此,比例校准模型能有效限制决策者区分不同患者群体间模型性能差异的能力,从而提升模型在实际应用中的可信度。我们提出一种高效的后处理算法用于实现风险预测模型的比例多校准,并通过实证研究验证其效果。通过模拟实验与基于急诊患者入院预测的真实场景应用,我们发现比例多校准在几乎不影响分类性能的前提下,能有效控制模型在交叉群体上的校准公平性联合指标。