While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality ``**question-proof-check**'' triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating an ``LLM-as-a-RM-for-RM'' approach and balanced token weighting to stabilize the RL process. Our experiments validate the model's scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.
翻译:尽管大型语言模型(LLM)通过*可验证奖励*的强化学习(RLVR)展现了强大的数学推理能力,但许多高阶数学问题以证明为基础,无法通过简单的答案匹配确保证明的真实性。为实现自动验证,需要一种能够可靠评估完整证明过程的奖励模型(RM)。本研究设计了一个*可扩展*的数据构建流程,以最小的人工成本利用LLM生成大量高质量的“**问题-证明-验证**”三元组数据。通过系统性地变换问题来源、生成方法和模型配置,我们构建了涵盖多难度层级、语言风格和错误类型的多样化问题-证明对,并经过分层人工审核以确保标签对齐。基于这些数据,我们训练了一个证明验证RM,其中融合了“以LLM作为RM的评估器”方法及均衡令牌加权策略以稳定强化学习过程。实验从奖励准确性、泛化能力和测试时引导等多个维度验证了模型的可扩展性与优异性能,为增强LLM的数学能力提供了重要的实践方案与工具。