The two primary approaches for high-dimensional regression problems are sparse methods (e.g. best subset selection which uses the L0-norm in the penalty) and ensemble methods (e.g. random forests). Although sparse methods typically yield interpretable models, they are often outperformed in terms of prediction accuracy by "blackbox" multi-model ensemble methods. We propose an algorithm to optimize an ensemble of L0-penalized regression models by extending recent developments in L0-optimization for sparse methods to multi-model regression ensembles. The sparse and diverse models in the ensemble are learned simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the response variable. We show how the ensembles achieve excellent prediction accuracy by exploiting the accuracy-diversity tradeoff of ensembles and investigate the effect of the number of models. In prediction tasks the ensembles can outperform state-of-the-art competitors on both simulated and real data. Forward stepwise regression is also generalized to multi-model regression ensembles and used to obtain an initial solution for our algorithm. The optimization algorithms are implemented in publicly available software packages.
翻译:高维回归问题的两种主要方法是稀疏方法(例如使用L0范数进行惩罚的最佳子集选择)和集成方法(例如随机森林)。尽管稀疏方法通常能产生可解释的模型,但其预测精度往往不如“黑箱”多模型集成方法。我们提出了一种算法,通过将稀疏方法中L0优化的最新进展扩展到多模型回归集成,来优化L0惩罚回归模型的集成。集成中的稀疏且多样化的模型是从数据中同时学习得到的。每个模型都提供了预测变量子集与响应变量之间关系的解释。我们展示了集成如何通过利用集成的精度-多样性权衡来实现卓越的预测精度,并研究了模型数量的影响。在预测任务中,该集成在模拟数据和真实数据上均能超越最先进的竞争者。前向逐步回归也被推广到多模型回归集成,并用于获取我们算法的初始解。优化算法已在公开可用的软件包中实现。