Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. While many studies compare imputation approaches, they often focus on a limited subset of algorithms and evaluate performance primarily through pointwise metrics such as RMSE, which are not suitable to measure the preservation of the true data distribution. In this work, we provide a systematic benchmarking method based on the idea of treating imputation as a distributional prediction task. We consider a large number of algorithms and, for the first time, evaluate them not only on synthetic missing mechanisms, but also on real-world missingness scenarios, using the concept of Imputation Scores. Finally, while the focus of previous benchmark has often been on numerical data, we also consider mixed data sets in our study. The analysis overwhelmingly confirms the superiority of iterative imputation algorithms, especially the methods implemented in the mice R package.
翻译:缺失值在现代数据科学中构成持续挑战。因此,各领域引入新插补方法的出版物数量不断增长。尽管许多研究比较了不同插补方法,但它们通常仅关注有限的算法子集,并主要通过均方根误差(RMSE)等点状指标评估性能,这些指标并不适用于衡量真实数据分布的保持程度。本研究基于将插补视为分布预测任务的理念,提出了一种系统性基准测试方法。我们考察了大量算法,并首次不仅通过合成缺失机制,还利用真实世界缺失场景下的插补评分概念进行评估。此外,以往基准测试多聚焦于数值数据,本研究亦将混合数据集纳入分析范围。分析结果显著证实了迭代插补算法的优越性,特别是mice R包中实现的方法。