Statistical learning methods for automated variable selection, such as LASSO, elastic nets, or gradient boosting, have become increasingly popular tools for building powerful prediction models. Yet, in practice, analyses are often complicated by missing data. The most widely used approach to address missingness is multiple imputation, which involves creating several completed datasets. However, there is an ongoing debate on how to perform model selection in the presence of multiple imputed datasets. Simple strategies, such as pooling models across datasets, have been shown to have suboptimal properties. Although more sophisticated methods exist, they are often difficult to implement and therefore not widely applied. In contrast, two recent approaches modify the regularization methods LASSO and elastic nets by defining a single loss function, resulting in a unified set of coefficients across imputations. Our key contribution is to extend this principle to the framework of component-wise gradient boosting by proposing MIBoost, a novel algorithm that employs a uniform variable-selection mechanism across imputed datasets. Simulation studies suggest that our approach yields prediction performance comparable to that of these recently proposed methods.
翻译:统计学习方法(如LASSO、弹性网络或梯度提升)已成为构建强大预测模型的日益流行的自动化变量选择工具。然而,在实践中,分析常因数据缺失而变得复杂。处理缺失数据最广泛使用的方法是多重插补,即创建多个完整的数据集。然而,对于如何在存在多重插补数据集的情况下进行模型选择,目前仍存在持续争论。简单的策略(如在数据集间合并模型)已被证明具有次优特性。尽管存在更复杂的方法,但它们通常难以实现,因此未能得到广泛应用。相比之下,最近提出的两种方法通过定义单一损失函数来修改正则化方法LASSO和弹性网络,从而在多次插补间获得统一的系数集。我们的核心贡献是将这一原理扩展到组件式梯度提升框架中,提出了一种新颖算法MIBoost,该算法在插补数据集间采用统一的变量选择机制。模拟研究表明,我们的方法在预测性能上与这些最新提出的方法相当。