Ensembling has a long history in statistical data analysis, with many impactful applications. However, in many modern machine learning settings, the benefits of ensembling are less ubiquitous and less obvious. We study, both theoretically and empirically, the fundamental question of when ensembling yields significant performance improvements in classification tasks. Theoretically, we prove new results relating the \emph{ensemble improvement rate} (a measure of how much ensembling decreases the error rate versus a single model, on a relative scale) to the \emph{disagreement-error ratio}. We show that ensembling improves performance significantly whenever the disagreement rate is large relative to the average error rate; and that, conversely, one classifier is often enough whenever the disagreement rate is low relative to the average error rate. On the way to proving these results, we derive, under a mild condition called \emph{competence}, improved upper and lower bounds on the average test error rate of the majority vote classifier. To complement this theory, we study ensembling empirically in a variety of settings, verifying the predictions made by our theory, and identifying practical scenarios where ensembling does and does not result in large performance improvements. Perhaps most notably, we demonstrate a distinct difference in behavior between interpolating models (popular in current practice) and non-interpolating models (such as tree-based methods, where ensembling is popular), demonstrating that ensembling helps considerably more in the latter case than in the former.
翻译:集成学习在统计分析领域有着悠久的历史,并产生了众多具有影响力的应用。然而,在许多现代机器学习场景中,集成学习的优势并不普遍也不明显。我们从理论和实验两方面研究了集成学习何时能在分类任务中带来显著性能提升这一基本问题。在理论层面,我们提出了新的结论,将*集成改进率*(衡量相对于单个模型,集成学习能在多大程度上降低错误率的相对指标)与*分歧-错误比*联系起来。研究表明,当分歧率相对于平均错误率较高时,集成学习能显著提升性能;反之,当分歧率相对于平均错误率较低时,单个分类器通常就已足够。在推导这些结论的过程中,我们在称为*胜任性*的温和条件下,得到了多数投票分类器平均测试错误率的改进上下界。为补充这一理论,我们还在多种设置下进行了集成学习的实证研究,验证了理论预测,并识别出集成学习能/不能带来巨大性能改善的实际场景。最值得注意的是,我们展示了插值模型(当前实践中常用)与非插值模型(如基于树的方法,其中集成学习很流行)之间的显著行为差异,表明集成学习在后者中的帮助远大于前者。