We propose a fundamental theory on ensemble learning that answers the central question: what factors make an ensemble system good or bad? Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the $\textit{accuracy}$ and $\textit{diversity}$ of models. We revisit the original Fano's inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss, which we name $\textit{combination loss}$. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.
翻译:我们提出了一项关于集成学习的基础理论,旨在回答核心问题:是什么因素使集成系统表现良好或糟糕?以往研究利用信息论中法诺不等式的一个变体,基于模型的$\textit{准确率}$和$\textit{多样性}$推导出分类错误率的下界。我们重新审视原始的法诺不等式,认为这些研究未考虑将多个模型预测组合成最终预测时丢失的信息。为解决此问题,我们将先前理论推广至包含信息损失,并将其命名为$\textit{组合损失}$。此外,通过对实际系统进行的广泛实验,我们实证验证并展示了所提理论。该理论揭示了各指标下系统的优缺点,这将推动对集成学习的理论理解,并为我们设计系统提供见解。