Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5\% success on Busy Beaver problems, it manages only 12\% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.
翻译:形式化定理证明(FTP)已成为评估大语言模型推理能力的关键基础,实现了数学证明的规模化自动验证。然而,由于人工标注成本高昂,且具有已验证形式化-非形式化对应关系的挑战性问题稀缺,该领域的进展受到有限数据集的制约。我们提出将理论计算机科学(TCS)作为严谨证明问题的可规模化来源,其中算法定义能够自动生成任意数量的挑战性定理-证明对。我们在两个TCS领域展示了该方法:忙碌海狸问题(涉及图灵机停机行为的界限证明)和混合布尔算术问题(结合逻辑与算术推理)。我们的框架自动合成具有并行形式化(Lean4)与非形式化(Markdown)规范的问题,构建了生成已验证证明挑战的可规模化流水线。对前沿模型的评估揭示了自动化定理证明中的显著差距:DeepSeekProver-V2-671B在忙碌海狸问题上达到57.5%的成功率,但在混合布尔算术问题上仅完成12%。这些结果凸显了长形式证明生成的难度——即便对于计算上易于验证的问题也是如此,从而证明了TCS领域在推进自动化推理研究中的价值。