Conformal prediction is a theoretically grounded framework for constructing predictive intervals. We study conformal prediction with missing values in the covariates -- a setting that brings new challenges to uncertainty quantification. We first show that the marginal coverage guarantee of conformal prediction holds on imputed data for any missingness distribution and almost all imputation functions. However, we emphasize that the average coverage varies depending on the pattern of missing values: conformal methods tend to construct prediction intervals that under-cover the response conditionally to some missing patterns. This motivates our novel generalized conformalized quantile regression framework, missing data augmentation, which yields prediction intervals that are valid conditionally to the patterns of missing values, despite their exponential number. We then show that a universally consistent quantile regression algorithm trained on the imputed data is Bayes optimal for the pinball risk, thus achieving valid coverage conditionally to any given data point. Moreover, we examine the case of a linear model, which demonstrates the importance of our proposal in overcoming the heteroskedasticity induced by missing values. Using synthetic and data from critical care, we corroborate our theory and report improved performance of our methods.
翻译:摘要:共形预测是一种基于理论构建预测区间的框架。本文研究了协变量中存在缺失值时的共形预测问题——这一场景为不确定性量化带来了新挑战。我们首先证明,在任意缺失值分布和几乎所有插补函数下,共形预测在插补数据上的边际覆盖保证仍然成立。然而,我们强调平均覆盖率会随缺失模式变化:共形方法倾向于构建某些缺失模式下条件响应覆盖不足的预测区间。这促使我们提出新型广义共形分位数回归框架——缺失数据增强方法,该框架能生成在指数级数量的缺失模式下均具有条件有效性的预测区间。接着证明,在插补数据上训练的一致分位数回归算法对分位数风险而言是贝叶斯最优的,从而实现对任意给定数据点的条件有效覆盖。此外,我们通过线性模型案例阐明了本方案在克服缺失值引发的异方差性方面的重要性。利用合成数据与重症监护临床数据,我们验证了理论分析并报告了所提方法的性能提升。