Missing values are prevalent across various fields, posing challenges for training and deploying predictive models. In this context, imputation is a common practice, driven by the hope that accurate imputations will enhance predictions. However, recent theoretical and empirical studies indicate that simple constant imputation can be consistent and competitive. This empirical study aims at clarifying if and when investing in advanced imputation methods yields significantly better predictions. Relating imputation and predictive accuracies across combinations of imputation and predictive models on 20 datasets, we show that imputation accuracy matters less i) when using expressive models, ii) when incorporating missingness indicators as complementary inputs, iii) matters much more for generated linear outcomes than for real-data outcomes. Interestingly, we also show that the use of the missingness indicator is beneficial to the prediction performance, even in MCAR scenarios. Overall, on real-data with powerful models, improving imputation only has a minor effect on prediction performance. Thus, investing in better imputations for improved predictions often offers limited benefits.
翻译:缺失值在各个领域中普遍存在,这给预测模型的训练和部署带来了挑战。在此背景下,插补是一种常见的做法,其驱动力在于期望准确的插补能够提升预测效果。然而,近期的理论和实证研究表明,简单的常数插补可以具有一致性且具有竞争力。本实证研究旨在阐明,投资于先进的插补方法是否以及何时能显著带来更好的预测。通过在20个数据集上结合不同的插补方法与预测模型,并关联插补准确度与预测准确度,我们发现:i) 当使用表达能力强的模型时,ii) 当将缺失指示符作为补充输入时,插补准确度的重要性降低;iii) 对于生成的线性结果,其重要性远高于真实数据结果。有趣的是,我们还发现,即使在完全随机缺失(MCAR)场景下,使用缺失指示符也有助于提升预测性能。总体而言,在拥有强大模型处理真实数据时,改进插补对预测性能的影响微乎其微。因此,为改进预测而投资于更好的插补方法,其收益往往有限。