Determining the optimal sample complexity of PAC learning in the realizable setting was a central open problem in learning theory for decades. Finally, the seminal work by Hanneke (2016) gave an algorithm with a provably optimal sample complexity. His algorithm is based on a careful and structured sub-sampling of the training data and then returning a majority vote among hypotheses trained on each of the sub-samples. While being a very exciting theoretical result, it has not had much impact in practice, in part due to inefficiency, since it constructs a polynomial number of sub-samples of the training data, each of linear size. In this work, we prove the surprising result that the practical and classic heuristic bagging (a.k.a. bootstrap aggregation), due to Breiman (1996), is in fact also an optimal PAC learner. Bagging pre-dates Hanneke's algorithm by twenty years and is taught in most undergraduate machine learning courses. Moreover, we show that it only requires a logarithmic number of sub-samples to reach optimality.
翻译:确定可实现设定下PAC学习的最优样本复杂度是学习理论领域数十年来的核心开放问题。最终,Hanneke(2016)的开创性工作提出了一种具有可证明最优样本复杂度的算法。该算法基于对训练数据进行精心设计的分层子采样,然后返回各子样本训练所得假设的多数投票结果。尽管这一理论成果令人振奋,但由于其构造了多项式数量且每个大小为线性的训练数据子样本,该算法效率低下,因此在实践中影响有限。在本研究中,我们证明了一个令人惊讶的结果:由Breiman(1996)提出的实用经典启发式方法——装袋法(即自助聚合),实际上也是一种最优的PAC学习器。装袋法比Hanneke的算法早二十年问世,且被收录于大多数本科机器学习课程教材中。此外,我们证明该方法仅需对数数量的子样本即可达到最优性。