In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model.
翻译:在良好校准的风险预测模型中,对于任意给定的子组,平均预测概率应接近真实事件发生率。这类模型在异质性群体中具有可靠性,并满足算法公平性的强定义。然而,由于潜在子组数量庞大,审计模型是否达到强校准是一项公认的难题——尤其对于机器学习算法而言。因此,常见做法仅针对少数预定义子组评估校准效果。近期拟合优度检验领域的发展提供了潜在解决方案,但这些方法并非针对弱信号或校准不佳子组规模较小的场景设计——它们要么过度细分数据,要么完全不对数据进行划分。我们提出一种基于以下洞察的新型检验流程:若能依据期望残差对观测值排序,则当存在校准不佳的子组时,预测残差与观测残差之间的关联性会在此序列中发生突变。这使我们能够将校准检验问题重新表述为变点检测问题,而后者已有成熟方法。我们首先引入样本拆分策略:用部分数据训练一组候选残差预测模型,剩余数据则用于执行基于得分的累积和(CUSUM)检验。为进一步提升检验功效,我们扩展了此自适应CUSUM检验以融合交叉验证,同时在最小假设下控制第一类错误率。相较于现有方法,本文提出的流程在模拟研究中持续获得更高检验功效,并在审计死亡率风险预测模型时将检验功效提升至两倍以上。