Ensemble learning combines several individual models to obtain a better generalization performance. In this work we present a practical method for estimating the joint power of several classifiers. It differs from existing approaches which focus on "diversity" measures by not relying on labels. This makes it both accurate and practical in the modern setting of unsupervised learning with huge datasets. The heart of the method is a combinatorial bound on the number of mistakes the ensemble is likely to make. The bound can be efficiently approximated in time linear in the number of samples. We relate the bound to actual misclassifications, hence its usefulness as a predictor of performance. We demonstrate the method on popular large-scale face recognition datasets which provide a useful playground for fine-grain classification tasks using noisy data over many classes.
翻译:集成学习通过组合多个独立模型以获得更优的泛化性能。本文提出了一种实用的方法,用于估计多个分类器的联合性能。该方法与现有聚焦于“多样性”度量的方法不同,无需依赖标签信息。这使得该方法在当代大规模数据集无监督学习场景中兼具准确性与实用性。其核心在于一个组合界,用于限定集成模型可能产生的错误数量。该界限可通过样本数量的线性时间复杂度高效近似计算。我们揭示了该界限与实际误分类率之间的关联,从而验证其作为性能预测指标的有效性。我们在流行的大规模人脸识别数据集上验证了该方法,这类数据集为多类别含噪数据上的细粒度分类任务提供了理想的研究平台。