We present a theory of ensemble diversity, explaining the nature of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the "holy grail" of ensemble learning, an open research issue for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of the ensemble loss. We prove a family of exact bias-variance-diversity decompositions, for both regression and classification, e.g., squared, cross-entropy, and Poisson losses. For losses where an additive bias-variance decomposition is not available (e.g., 0/1 loss) we present an alternative approach, which precisely quantifies the effects of diversity, turning out to be dependent on the label distribution. Experiments show how we can use our framework to understand the diversity-encouraging mechanisms of popular methods: Bagging, Boosting, and Random Forests.
翻译:我们提出了一种集成多样性理论,解释了在广泛监督学习场景中多样性的本质。理解集成多样性这一挑战被称为集成学习的"圣杯",是一个持续了30多年的开放研究问题。我们的框架揭示出多样性实际上是在集成损失偏差-方差分解中的一个隐藏维度。我们证明了一系列精确的偏差-方差-多样性分解公式,适用于回归和分类任务,例如平方损失、交叉熵损失和泊松损失。对于无法进行加性偏差-方差分解的损失函数(如0/1损失),我们提出了一种替代方法,该方法精确量化了多样性的影响,并发现其依赖于标签分布。实验表明,我们可以利用我们的框架来理解流行方法(Bagging、Boosting和随机森林)的多样性鼓励机制。