While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality "**question-proof-check**" triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model's scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.
翻译:尽管大型语言模型(LLMs)通过*可验证奖励*的强化学习(RLVR)展现了强大的数学推理能力,但许多高级数学问题属于证明类问题,无法通过简单的答案匹配来确保证明的真实性。为实现自动验证,需要一种能够可靠评估完整证明过程的奖励模型(RM)。本研究设计了一个*可扩展*的数据构建流程,以最小的人力成本利用LLMs生成大量高质量的“**问题-证明-校验**”三元组数据。通过系统性地变换问题来源、生成方法和模型配置,我们创建了涵盖多难度层级、语言风格和错误类型的多样化问题-证明对,并经过分层人工审核以确保标签对齐。基于这些数据,我们训练了一个证明校验奖励模型,通过引入过程奖励和令牌权重平衡机制来稳定强化学习过程。实验从奖励精度、泛化能力和测试时指导等多个角度验证了模型的可扩展性与优异性能,为增强LLM的数学能力提供了重要的实践方案与工具。