Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the d $\sqrt$ n regime. Experiments illustrate our findings.
翻译:现有两种处理预测中缺失值的方法:一是在拟合任何预测算法之前进行插值,二是采用能够原生处理缺失值的专用方法。尽管插值方法被广泛(且简便地)使用,但在后续应用低容量预测器(如线性模型)时,这种方法存在偏差。然而在实际中,朴素插值却展现出良好的预测性能。本文研究了高维线性模型中,在MCAR缺失数据条件下插值的影响。我们证明零插值法执行了一种与岭方法密切相关的隐式正则化,而岭方法常用于高维问题。基于这种关联,我们证明了插值偏差受岭偏差控制,且该偏差在高维条件下趋于消失。作为预测器,我们主张对零插值数据采用平均SGD策略。我们建立了其泛化误差的上界,凸显了在d $\sqrt$ n 体制下插值的良性特征。实验验证了我们的发现。