Conditions ensuring optimal parameter estimation in the presence of missing data are well established in inference, typically relying on the Missing-at-Random (MAR) assumption. In prediction, similar principles are often assumed to apply. However, methods considered biased in inference, such as pattern sub-modelling or unconditional imputation, have been shown to achieve optimal predictive performance under any missingness mechanism, including non-MAR (MNAR). To explain this apparent contradiction, we introduce a new formal framework for describing missingness in prediction. Central to this framework is a distinction between two prediction targets, defined according to whether or not the indicator of observation of the predictors is exploited to predict the outcome. This distinction leads to a classification of the missingness mechanisms describing the conditions under which these targets are equal, and when consistent prediction of each is achievable. A key result is that both targets may be consistently predicted under conditions weaker than MAR. We discuss the implications of this paradigm for handling missing data in prediction, distinguishing between missingness at development, validation and deployment of a forecaster. The findings are illustrated using simulated data and a real-world application with the prediction of significant injury after trauma upon arrival at the emergency department.
翻译:在推断中,确保存在缺失数据时获得最优参数估计的条件已得到充分确立,通常依赖于随机缺失(MAR)假设。在预测中,人们常假定类似原理适用。然而,在推断中被视为有偏的方法(如模式子建模或无条件插补)已被证明可在任何缺失机制下(包括非随机缺失MNAR)实现最优预测性能。为解释这一看似矛盾的现象,我们引入了一个描述预测中缺失情况的新形式化框架。该框架的核心在于区分两种预测目标,其界定依据是否利用预测变量观测指标来预测结果。这种区分引出了缺失机制的分类,描述了这些目标相等的条件,以及各自可实现一致预测的条件。一个关键结论是:两种目标均可在弱于MAR的条件下实现一致预测。我们讨论了这一范式对处理预测中缺失数据的影响,区分了预测模型开发、验证和部署阶段的缺失问题。通过模拟数据及一项真实应用(急诊科到达时创伤后严重损伤预测),对研究结果进行了说明。