Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first develop an identification result showing that the widely used fully conditional specification (FCS) approach indeed identifies the correct conditional distributions. Based on this analysis, we propose three essential properties an ideal imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new imputation method, denoted mice-DRF, that meets two out of the three criteria. We also discuss ways to compare imputation methods, based on distributional distances. Finally, numerical experiments illustrate the points made in this discussion.
翻译:缺失值问题是现代数据科学中一个持续存在的挑战。因此,各领域引入新插补方法的出版物数量不断增长。本文试图退后一步,提供一个更系统化的分析。从深入讨论非参数插补中的随机缺失条件出发,我们首先推导出一个识别结果,表明广泛使用的完全条件设定方法确实能识别出正确的条件分布。基于此分析,我们提出了理想插补方法应满足的三项基本属性,从而能够对现有方法进行更原则性的评估,并更有针对性地开发新方法。特别地,我们引入了一种新的插补方法,记为mice-DRF,它满足三项标准中的两项。我们还讨论了基于分布距离来比较不同插补方法的途径。最后,数值实验阐释了本文讨论的要点。