The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, which requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this paper, we propose a scalable MI framework mixgb, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and non-linear relations while achieving high computational efficiency. In addition, we incorporate subsampling and predictive mean matching to reduce bias and better account for appropriate imputation variability. The proposed framework is implemented in an R package mixgb. Supplementary materials for this article are available online.
翻译:多重插补(MI)在缺失数据处理中的应用日益广泛。尽管部分传统MI方法已得到充分研究并具有经验有效性,但在处理具有复杂数据结构的大型数据集时仍存在局限:其插补性能通常依赖于插补模型的正确设定,而这需要具备变量间内在关系的专业知识;此外,标准方法在中等及大型数据集上的计算效率往往较低。本文提出一种可扩展的MI框架mixgb,该框架基于XGBoost、子采样(subsampling)和预测均值匹配(predictive mean matching)技术。我们的方法利用梯度提升树快速实现算法XGBoost的强大功能,自动捕捉变量间的交互作用与非线性关系,同时实现高计算效率。此外,我们引入子采样和预测均值匹配以降低偏差,并更恰当地反映合理的插补变异性。该框架已通过R语言包mixgb实现。本文补充材料可在线获取。