LLMs excel at reasoning, but validating their steps remains challenging. Formal verification offers a solution through mechanically checkable proofs. Interactive theorem provers (ITPs) dominate mathematical reasoning but require detailed low-level proof steps, while auto-active verifiers offer automation but focus on software verification. Recent work has begun bridging this divide by evaluating LLMs for software verification in ITPs, but the complementary direction--LLMs for mathematical theorem proving in auto-active verifiers--remains unexplored. We present MINIF2F-DAFNY, the first translation of the widely-used mathematical benchmark miniF2F to an auto-active verifier: Dafny. We find that Dafny's automation alone solves 39-44% of problems with empty proofs, whereas many require substantial proof guidance in ITPs. For remaining problems, we evaluate 7 off-the-shelf LLMs, achieving 55.7% success with the best model (Claude Sonnet 4.5) using modest resources. These results demonstrate effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .
翻译:大型语言模型(LLM)在推理方面表现出色,但验证其推理步骤仍具挑战性。形式化验证通过提供机器可检查的证明提供了一种解决方案。交互式定理证明器(ITP)在数学推理领域占主导地位,但需要详细、低层次的证明步骤;而自动主动验证器虽能提供自动化支持,但其主要关注软件验证。近期研究开始通过评估LLM在ITP中进行软件验证来弥合这一鸿沟,但相反方向——利用LLM在自动主动验证器中进行数学定理证明——仍未得到探索。我们提出了MINIF2F-DAFNY,这是首次将广泛使用的数学基准测试miniF2F翻译到自动主动验证器Dafny中。我们发现,仅凭Dafny的自动化能力,在无需提供证明内容的情况下即可解决39-44%的问题,而这些问题中的许多在ITP中需要大量的证明指导。对于剩余问题,我们评估了7个现成的LLM,在使用适度资源的情况下,最佳模型(Claude Sonnet 4.5)取得了55.7%的成功率。这些结果展示了有效的分工:LLM提供高层次指导,而自动化处理低层次细节。我们的基准测试可在GitHub上找到:http://github.com/dafny-lang/miniF2F。