Constant (naive) imputation is still widely used in practice as this is a first easy-to-use technique to deal with missing data. Yet, this simple method could be expected to induce a large bias for prediction purposes, as the imputed input may strongly differ from the true underlying data. However, recent works suggest that this bias is low in the context of high-dimensional linear predictors when data is supposed to be missing completely at random (MCAR). This paper completes the picture for linear predictors by confirming the intuition that the bias is negligible and that surprisingly naive imputation also remains relevant in very low dimension.To this aim, we consider a unique underlying random features model, which offers a rigorous framework for studying predictive performances, whilst the dimension of the observed features varies.Building on these theoretical results, we establish finite-sample bounds on stochastic gradient (SGD) predictors applied to zero-imputed data, a strategy particularly well suited for large-scale learning.If the MCAR assumption appears to be strong, we show that similar favorable behaviors occur for more complex missing data scenarios.
翻译:常数(朴素)插值法作为处理缺失数据的一种简便易用技术,在实践中仍被广泛采用。然而,由于插值后的输入可能与真实底层数据存在显著差异,这种简单方法预期会在预测中引入较大偏差。但近期研究表明,当数据完全随机缺失(MCAR)时,在高维线性预测器背景下该偏差实际上较小。本文通过证实"偏差可忽略不计"这一直觉,并揭示朴素插值法在极低维场景中仍具出人预料的适用性,完善了线性预测器的相关认知图景。为实现这一目标,我们采用独特的底层随机特征模型,该模型为研究观测特征维度变化时的预测性能提供了严格框架。基于理论成果,我们建立了针对零值插值数据的随机梯度下降(SGD)预测器的有限样本界,该策略特别适用于大规模学习场景。尽管MCAR假设看似严苛,研究表明更复杂的缺失数据场景中同样会出现类似优越表现。