Missing data are ubiquitous in empirical databases, yet statistical analyses typically require complete data matrices. Multiple imputation offers a principled solution for filling these gaps. This study evaluates the performance of several multiple imputation methods, both in the presence and absence of extreme values, using the MICE package in R. Through Monte Carlo simulations, we generated incomplete data sets with three variables and assessed each imputation method within regression models. The results indicate that the linear regression based imputation method showed the best overall predictive performance (CV-MSE), whereas the sparse model approach was generally less efficient. Our findings underscore the relevance of extreme values when selecting an imputation strategy and highlight sample size, proportion of missingness, presence of extremes, and the type of fitted model as key determinants of performance. Despite its limitations, the study offers practical recommendations for researchers, stressing the need to examine the missingness mechanism and the occurrence of extreme values before choosing an imputation method.
翻译:缺失数据在实证数据库中普遍存在,而统计分析通常需要完整的数据矩阵。多重插补为填补这些空缺提供了一种基于原则的解决方案。本研究利用R语言中的MICE包,评估了在存在与不存在极端值的情况下多种多重插补方法的性能。通过蒙特卡洛模拟,我们生成了包含三个变量的不完整数据集,并在回归模型中评估了每种插补方法。结果表明,基于线性回归的插补方法展现出最佳的整体预测性能(CV-MSE),而稀疏模型方法通常效率较低。我们的发现强调了在选择插补策略时极端值的重要性,并指出样本量、缺失比例、极端值的存在以及所拟合模型的类型是影响性能的关键决定因素。尽管存在局限性,本研究为研究者提供了实用的建议,强调在选择插补方法前,需要仔细考察缺失机制与极端值的出现情况。