Ensemble classifiers are predictive models that combine the results of simpler base models, often by majority vote. A classic example is random forests, which combine the predictions of decision trees. Ensembles that use more base models can be more accurate but also more costly to train and run. In this paper, we consider strategies for reducing the computational cost of binary classification using an approach from the field of sequential testing. Rather than evaluating all the base models and taking a majority vote, we evaluate the base models sequentially and stop execution when a clear majority emerges. We consider three different notions of optimality for early-stopping strategies that minimize the number of base models executed while controlling the rate of disagreement with the full ensemble. For each notion of optimality and allowable disagreement rate, we show that a linear program can be constructed and solved efficiently to find the optimal stopping strategy. We tested these methods on real-world datasets taken from the UC Irvine Machine Learning repository, and on the benchmark datasets proposed by Grinsztajn et al. We found that on most datasets, these methods provide speed-ups of 4x or more while controlling disagreement at 0.1%
翻译:集成分类器是一种预测模型,通过多数投票等方式整合多个基础模型的预测结果,典型示例为随机森林——其综合了多个决策树的预测。使用更多基础模型的集成虽能提升精度,但也导致训练与运行成本增加。本文提出一种基于顺序测试理论的策略,旨在降低二元分类问题的计算开销。该方法无需评估所有基础模型并进行多数投票,而是通过顺序评估基础模型,当出现明确多数结果时提前终止计算。我们针对三种不同的最优性定义,设计了最小化基础模型执行数量且控制与完整集成分类器差异率的提前终止策略。针对每种最优性定义及允许的差异率,证明可通过构造并高效求解线性规划来获取最优终止策略。我们在加州大学欧文分校机器学习库的真实数据集以及Grinsztajn等人提出的基准数据集上验证了该方法。实验结果表明,在多数数据集上,该方法可在控制差异率为0.1%的同时实现至少4倍的加速效果。