Statistical learning methods for automated variable selection, such as the Least Absolute Shrinkage and Selection Operator (LASSO), elastic nets, and gradient boosting, have become increasingly popular tools for building powerful prediction models. Yet, in practice, analyses are often complicated by missing data. The most widely used approach to address missingness is multiple imputation, which involves creating several completed datasets. However, there is an ongoing debate about how to perform model selection in the presence of multiple imputed datasets. Simple strategies, such as pooling models across datasets, have been shown to have suboptimal properties. Although more sophisticated methods exist, they are often difficult to implement and therefore not widely applied. In contrast, two recent approaches extend the regularization methods LASSO and elastic nets to multiply imputed datasets by defining a single loss function, resulting in a unified set of coefficients across imputations. Our key contribution is to extend this principle to the framework of component-wise gradient boosting by proposing MIBoost, a novel algorithm that employs a uniform variable-selection mechanism across imputed datasets, together with its corresponding cross-validation routine MIBoostCV. In a simulation study, MIBoost yielded predictive performance comparable to that of other established methods, providing a practical boosting-based approach for variable selection with multiply imputed data. The proposed framework is implemented as the R package booami.
翻译:自动变量选择的统计学习方法,如最小绝对收缩与选择算子(LASSO)、弹性网络和梯度提升,已成为构建强大预测模型的日益流行的工具。然而,在实际分析中,缺失数据常常使分析复杂化。处理缺失数据最广泛使用的方法是多重插补,即创建多个完整数据集。但在存在多个插补数据集的情况下如何进行模型选择仍存在持续争论。简单的策略(例如跨数据集汇集模型)已被证明具有次优特性。尽管存在更复杂的方法,但它们通常难以实施,因此应用不广泛。相比之下,两种近期的研究方法通过定义单一损失函数,将正则化方法LASSO和弹性网络扩展到多重插补数据集,从而在插补间产生统一的系数集。我们的主要贡献是将这一原理扩展到分量式梯度提升框架,提出了MIBoost——一种在插补数据集间采用统一变量选择机制的新算法,及其相应的交叉验证程序MIBoostCV。在模拟研究中,MIBoost的预测性能与其他既定方法相当,为处理多重插补数据的变量选择提供了一种实用的基于提升的方法。所提出的框架已在R包booami中实现。