Ensemble learning is a popular technique to improve the accuracy of machine learning models. It hinges on the rationale that aggregating multiple weak models can lead to better models with lower variance and hence higher stability, especially for discontinuous base learners. In this paper, we provide a new perspective on ensembling. By selecting the best model trained on subsamples via majority voting, we can attain exponentially decaying tails for the excess risk, even if the base learner suffers from slow (i.e., polynomial) decay rates. This tail enhancement power of ensembling is agnostic to the underlying base learner and is stronger than variance reduction in the sense of exhibiting rate improvement. We demonstrate how our ensemble methods can substantially improve out-of-sample performances in a range of examples involving heavy-tailed data or intrinsically slow rates. Code for the proposed methods is available at https://github.com/mickeyhqian/VoteEnsemble.
翻译:集成学习是一种提升机器学习模型准确性的常用技术。其核心原理在于,通过聚合多个弱模型可以获得方差更低、稳定性更高的更优模型,尤其对于不连续的基学习器而言。本文为集成学习提供了一个新的视角。通过多数投票法选择在子样本上训练的最佳模型,我们能够使超额风险的尾部以指数速率衰减,即使基学习器本身具有较慢(即多项式)的衰减速率。这种集成学习的尾部增强能力与底层基学习器无关,并且在体现速率改进的意义上强于方差缩减。我们通过一系列涉及重尾数据或固有慢速率的实例,展示了所提出的集成方法如何显著提升样本外性能。所提方法的代码可在 https://github.com/mickeyhqian/VoteEnsemble 获取。