This simulation study evaluates the effectiveness of multiple imputation (MI) techniques for multilevel data. It compares the performance of traditional Multiple Imputation by Chained Equations (MICE) with tree-based methods such as Chained Random Forests with Predictive Mean Matching and Extreme Gradient Boosting. Adapted versions that include dummy variables for cluster membership are also included for the tree-based methods. Methods are evaluated for coefficient estimation bias, statistical power, and type I error rates on simulated hierarchical data with different cluster sizes (25 and 50) and levels of missingness (10\% and 50\%). Coefficients are estimated using random intercept and random slope models. The results show that while MICE is preferred for accurate rejection rates, Extreme Gradient Boosting is advantageous for reducing bias. Furthermore, the study finds that bias levels are similar across different cluster sizes, but rejection rates tend to be less favorable with fewer clusters (lower power, higher type I error). In addition, the inclusion of cluster dummies in tree-based methods improves estimation for Level 1 variables, but is less effective for Level 2 variables. When data become too complex and MICE is too slow, extreme gradient boosting is a good alternative for hierarchical data. Keywords: Multiple imputation; multi-level data; MICE; missRanger; mixgb
翻译:本模拟研究评估了多元插值技术对多层数据的有效性。研究比较了传统链式方程多元插值与基于树的方法(如含预测均值匹配的链式随机森林和极端梯度提升)的性能。对于基于树的方法,还纳入了包含聚类成员虚拟变量的适应性版本。通过模拟不同聚类规模(25和50)及缺失率(10%和50%)的分层数据,评估了这些方法在系数估计偏差、统计功效和第一类错误率方面的表现。系数分别采用随机截距模型和随机斜率模型进行估计。结果表明,链式方程多元插值在精确拒绝率方面更优,而极端梯度提升在减少偏差方面更具优势。此外,研究发现不同聚类规模下的偏差水平相似,但聚类数量较少时拒绝率更不理想(功效较低且第一类错误较高)。同时,在基于树的方法中加入聚类虚拟变量可改善对第一层变量的估计,但对第二层变量效果较差。当数据过于复杂且链式方程多元插值速度缓慢时,极端梯度提升是处理分层数据的良好替代方案。关键词:多元插值;多层数据;MICE;missRanger;mixgb