Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at https://github.com/weating/Mask-Proof.
翻译:大型语言模型在数学问题求解方面能力日益增强,甚至能辅助研究级证明,但我们仍缺乏一种可扩展且可复现的方法,来衡量来自不同来源的长篇证明中的逐步推理能力。这一评估差距限制了在经证明认证的科学进展中实现可信赖的人工智能辅助。现有评估常侧重于最终答案或依赖昂贵的专家评分,而端到端的证明生成仍具有开放性且难以自动验证。我们提出Mask-Proof,一种将真实证明转化为可自动检查的掩码步骤任务的流水线。它掩码关键公式步骤,提供必要的上下文,并利用基于大语言模型的等效性评判器通过重复投票进行稳定性评估,从而重构模型。由此生成的Mask-ProofBench包含来自不同研究领域的292个精选问题。对17个模型的实验表明,增强推理能力的模型在性能上比标准模型高出12%至27%。我们的评判器与专家标注者的一致性达到96.8%,从而实现了对逐步数学推理进行忠实、可复现且可比较的测量。基准测试、标注及代码可在https://github.com/weating/Mask-Proof获取。