In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data, through multiple imputation. Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.
翻译:在许多应用场景中,数据存在缺失条目导致分析困难。大量文献在推断框架下处理缺失值问题:从不完整表格中估计参数及其方差。本研究聚焦于监督学习场景:在训练和测试数据均出现缺失值时预测目标变量。我们证明了两种预测方法的一致性。一个引人注目的结果是,当缺失值不具信息性时,广泛使用的常数填充方法(如学习前的均值填充)具有一致性。这与推断场景中均值填充因扭曲数据分布而受到批评的情况形成对比。这种简单方法具有一致性的发现对实际应用具有重要意义。我们还证明,通过多重插补,适用于完整观测数据的预测器能够在缺失数据上实现最优预测。最后,为比较插补方法与直接使用可处理缺失值的模型进行学习的效果,我们进一步分析了决策树。由于能处理不完整变量的半离散特性,决策树可自然地解决含缺失值的经验风险最小化问题。在从理论和实证角度比较树模型中不同缺失值处理策略后,我们推荐使用"属性内置缺失值"方法,因其可同时处理非信息性和信息性缺失值。