Missing data is a common issue in real-world datasets. This paper studies the performance of impute-then-regress pipelines by contrasting theoretical and empirical evidence. We establish the asymptotic consistency of such pipelines for a broad family of imputation methods. While common sense suggests that a `good' imputation method produces datasets that are plausible, we show, on the contrary, that, as far as prediction is concerned, crude can be good. Among others, we find that mode-impute is asymptotically sub-optimal, while mean-impute is asymptotically optimal. We then exhaustively assess the validity of these theoretical conclusions on a large corpus of synthetic, semi-real, and real datasets. While the empirical evidence we collect mostly supports our theoretical findings, it also highlights gaps between theory and practice and opportunities for future research, regarding the relevance of the MAR assumption, the complex interdependency between the imputation and regression tasks, and the need for realistic synthetic data generation models.
翻译:缺失数据是实际数据集中常见的问题。本文通过对比理论与实证证据,研究“先插补后回归”流程的性能。我们建立了一类广泛插补方法在该流程中的渐近一致性。尽管常识认为“良好”的插补方法能产生合理的数据集,我们却证明相反结论:就预测而言,粗糙的方法可能效果更佳。其中我们发现,众数插补渐近次优,而均值插补渐近最优。随后,我们在大量合成、半真实及真实数据集上全面评估这些理论结论的有效性。虽然收集的实证证据基本支持理论发现,但也凸显了理论与实践的差距,以及未来研究的方向——涉及MAR假设的适用性、插补与回归任务间复杂的相互依存关系,以及构建逼真合成数据生成模型的必要性。