Understanding and certifying the generalization performance of machine learning algorithms -- i.e. obtaining theoretical estimates of the test error from a finite training sample -- is a central theme of statistical learning theory. Among the many complexity measures used to derive such guarantees, Rademacher complexity yields sharp, data-dependent bounds that apply well beyond classical $0$--$1$ classification. In this study, we formalize the generalization error bound by Rademacher complexity in Lean 4, building on measure-theoretic probability theory available in the Mathlib library. Our development provides a mechanically-checked pipeline from the definitions of empirical and expected Rademacher complexity, through a formal symmetrization argument and a bounded-differences analysis, to high-probability uniform deviation bounds via a formally proved McDiarmid inequality. A key technical contribution is a reusable mechanism for lifting results from countable hypothesis classes (where measurability of suprema is straightforward in Mathlib) to separable topological index sets via a reduction to a countable dense subset. As worked applications of the abstract theorem, we mechanize standard empirical Rademacher bounds for linear predictors under $\ell_2$ and $\ell_1$ regularization, and we also formalize a Dudley-type entropy integral bound based on covering numbers and a chaining construction.
翻译:理解和验证机器学习算法的泛化性能——即从有限训练样本中获得测试误差的理论估计——是统计学习理论的核心主题。在众多用于推导此类保证的复杂度度量中,Rademacher复杂度能够产生尖锐的、数据依赖的界,其适用范围远超经典的0-1分类问题。在本研究中,我们基于Mathlib库中可用的测度论概率理论,在Lean 4中形式化了由Rademacher复杂度导出的泛化误差界。我们的开发提供了一个经过机械验证的流程:从经验Rademacher复杂度和期望Rademacher复杂度的定义出发,通过一个形式化的对称化论证和有界差分分析,最终借助一个经过形式化证明的McDiarmid不等式,得到高概率一致偏差界。一个关键的技术贡献是提出了一种可复用的机制,用于将结果从可数假设类(在Mathlib中其上确界的可测性是直接的)通过约简到可数稠密子集的方式,提升到可分离拓扑指标集。作为抽象定理的具体应用,我们机械化地推导了在$\ell_2$和$\ell_1$正则化下线性预测器的标准经验Rademacher界,并且还形式化了一个基于覆盖数和链式构造的Dudley型熵积分界。