Is this model reliable for everyone? Testing for strong calibration

In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model.

翻译：在良好校准的风险预测模型中，对于任意给定的子组，平均预测概率应接近真实事件发生率。这类模型在异质性群体中具有可靠性，并满足算法公平性的强定义。然而，由于潜在子组数量庞大，审计模型是否达到强校准是一项公认的难题——尤其对于机器学习算法而言。因此，常见做法仅针对少数预定义子组评估校准效果。近期拟合优度检验领域的发展提供了潜在解决方案，但这些方法并非针对弱信号或校准不佳子组规模较小的场景设计——它们要么过度细分数据，要么完全不对数据进行划分。我们提出一种基于以下洞察的新型检验流程：若能依据期望残差对观测值排序，则当存在校准不佳的子组时，预测残差与观测残差之间的关联性会在此序列中发生突变。这使我们能够将校准检验问题重新表述为变点检测问题，而后者已有成熟方法。我们首先引入样本拆分策略：用部分数据训练一组候选残差预测模型，剩余数据则用于执行基于得分的累积和（CUSUM）检验。为进一步提升检验功效，我们扩展了此自适应CUSUM检验以融合交叉验证，同时在最小假设下控制第一类错误率。相较于现有方法，本文提出的流程在模拟研究中持续获得更高检验功效，并在审计死亡率风险预测模型时将检验功效提升至两倍以上。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日