Evaluating tree-based imputation methods as an alternative to MICE PMM for drawing inference in empirical studies

from arxiv, The project "From Prediction to Agile Interventions in the Social Sciences (FAIR)" is receiving funding from the programme "Profilbildung 2020'', an initiative of the Ministry of Culture and Science of the State of Northrhine Westphalia. The sole responsibility for the content of this publication lies with the authors

Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures. The performance and validity of such methods are of great importance for their application in empirical studies. While the prevailing method of Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) is considered standard in the social science literature, the increase in complex datasets may require more advanced approaches based on machine learning. In particular, tree-based imputation methods have emerged as very competitive approaches. However, the performance and validity are not completely understood, particularly compared to the standard MICE PMM. This is especially true for inference in linear models. In this study, we investigate the impact of various imputation methods on coefficient estimation, Type I error, and power, to gain insights that can help empirical researchers deal with missingness more effectively. We explore MICE PMM alongside different tree-based methods, such as MICE with Random Forest (RF), Chained Random Forests with and without PMM (missRanger), and Extreme Gradient Boosting (MIXGBoost), conducting a realistic simulation study using the German National Educational Panel Study (NEPS) as the original data source. Our results reveal that Random Forest-based imputations, especially MICE RF and missRanger with PMM, consistently perform better in most scenarios. Standard MICE PMM shows partially increased bias and overly conservative test decisions, particularly with non-true zero coefficients. Our results thus underscore the potential advantages of tree-based imputation methods, albeit with a caveat that all methods perform worse with an increased missingness, particularly missRanger.

翻译：处理缺失数据是统计分析中的一个重要问题，通常通过插补方法来解决。这类方法的性能和有效性对于其在实证研究中的应用至关重要。虽然链式方程多重插补（MICE）结合预测均值匹配（PMM）在社会学文献中被视为标准方法，但复杂数据集的增加可能需要基于机器学习的更先进方法。特别是，基于树的插补方法已成为极具竞争力的选择。然而，其性能和有效性尚未被完全理解，尤其是与标准MICE PMM相比。这一点在线性模型推断中尤为突出。在本研究中，我们探讨了各种插补方法对系数估计、第一类错误和统计功效的影响，旨在为实证研究人员更有效地处理缺失数据提供见解。我们研究了MICE PMM以及不同的基于树的方法，例如结合随机森林的MICE（MICE RF）、是否结合PMM的链式随机森林（missRanger）、以及极限梯度提升（MIXGBoost），并以德国国家教育面板研究（NEPS）作为原始数据来源进行了逼真的模拟研究。结果显示，基于随机森林的插补方法，特别是MICE RF和结合PMM的missRanger，在大多数场景中表现更佳。标准MICE PMM在某些情况下显示出部分增加的偏差和过于保守的检验决策，尤其是在处理非真实零系数时。因此，我们的结果突显了基于树的插补方法的潜在优势，但需注意所有方法在缺失率增加时表现均会变差，尤其是missRanger。