In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data,through multiple imputation.Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.
翻译:在许多应用场景中,数据存在缺失条目,这使得分析变得具有挑战性。已有大量文献在推断框架中处理缺失值问题:从不完整表格中估计参数及其方差。本文考虑监督学习场景:当训练数据和测试数据均出现缺失值时预测目标变量。我们展示了两种预测方法的一致性。一个显著发现是,当缺失值不具备信息性时,广泛使用的常数填充方法(如学习前用均值填充)具有一致性。这与推断场景中均值填充会扭曲数据分布的结论形成对比——这种简单方法能够保持一致性的特性在实际应用中具有重要意义。我们还证明,适用于完整观测数据的预测器可以通过多重插补在不完整数据上实现最优预测。最后,为比较插补方法与直接使用处理缺失值的模型进行学习的效果,我们进一步分析了决策树。由于决策树能够处理不完整变量的半离散特性,因此可以自然地解决含缺失值的经验风险最小化问题。在从理论和实证角度对比树模型中不同的缺失值处理策略后,我们推荐使用"缺失值纳入属性"方法,因其能同时处理非信息性和信息性缺失值。