The ecosystem of Lean and Mathlib has become the de facto standard for large language model (LLM) assisted formal reasoning with remarkable successes in recent years. Those successes, however, only consume Mathlib as an essential dependency but do not directly contribute to it. In the meantime, the growth of Mathlib has recently been bottlenecked by the review process, which requires human reviewers to judge whether proposed pull requests (PRs) follow the Mathlib's conventions and are worth integrating as part of a shared mathematical infrastructure. This leads to our central question: can LLMs help review Mathlib PRs? To this end, we introduce MathlibPR, a benchmark built from real Mathlib4 PR histories. We further propose a staged evaluation protocol and use it to evaluate both LLM models (e.g., DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (e.g., Codex and Claude Code). Surprisingly, both LLM models and LLM agents struggle to distinguish merge-ready PRs from build-passing PRs that were revised or never merged. By turning Mathlib PR histories into a supervised signal, MathlibPR provides a step toward reviewer assistants and reward models that could help evaluate PRs and steer LLMs toward producing merge-ready Mathlib contributions.
翻译:Lean与Mathlib生态系统已成为大语言模型辅助形式化推理的事实标准,近年来取得了显著成功。然而,这些成功仅将Mathlib作为关键依赖项使用,并未直接为其做出贡献。与此同时,Mathlib的增长近期受限于评审流程:该流程需要人工评审者判断拉取请求是否符合Mathlib的规范,以及是否值得整合到共享数学基础设施中。这引出了我们的核心问题:大语言模型能否协助评审Mathlib的拉取请求?为此,我们提出了MathlibPR——一个基于真实Mathlib4拉取请求历史构建的基准。我们进一步设计了分阶段评估协议,并以此评估了多款LLM模型(如DeepSeek、Qwen、Goedel、Kimina)与LLM智能体(如Codex、Claude Code)。令人惊讶的是,无论是LLM模型还是LLM智能体,均难以区分合并就绪的拉取请求与已修订或未被合并的构建通过型拉取请求。通过将Mathlib的拉取请求历史转化为监督信号,MathlibPR为开发能够评估拉取请求的评审助手和奖励模型迈出了关键一步,从而引导LLM生成合并就绪的Mathlib贡献。