We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is available at https://github.com/YuanheZ/lean-stat-learning-theory
翻译:我们首次在Lean 4中基于实证过程理论,对统计学习理论进行了全面的形式化。我们的端到端形式化基础设施实现了最新Lean 4 Mathlib库中缺失的内容,包括高斯Lipschitz集中性的完整推导、子高斯过程的Dudley熵积分定理的首次形式化,以及一个具有尖锐速率的(稀疏)最小二乘回归的应用。该项目采用人机协作的工作流程完成,其中人类设计证明策略,AI代理执行战术性证明构造,最终形成了经过人工验证的、用于统计学习理论的Lean 4工具箱。除了实现之外,形式化过程还揭示并解决了标准统计学习理论教科书中隐含的假设和缺失的细节,从而强制了对理论进行逐行细粒度的理解。这项工作建立了一个可重用的形式化基础,并为机器学习理论的未来发展打开了大门。代码可在 https://github.com/YuanheZ/lean-stat-learning-theory 获取。