Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.
翻译:大型语言模型在数学推理方面取得了显著进展,这不仅是人工智能的重要测试平台,其进一步发展还可能对科学研究产生影响。通过采用强化学习对最终正确答案进行奖励来扩展推理能力,LLMs在一年内从表现不佳提升到在AIME和HMMT等定量推理竞赛中达到饱和水平。然而,这种方法面临根本性局限。追求更高的最终答案准确率并未解决一个关键问题:正确答案并不能保证推理过程的正确性。此外,许多数学任务(如定理证明)需要严格的逐步推导而非数值答案,这使得基于最终答案的奖励机制不再适用。为了突破深度推理的极限,我们认为有必要验证数学推理的全面性与严谨性。自我验证对于扩展测试时计算资源尤为重要,特别是对于没有已知解的开放性问题。为实现可自我验证的数学推理,我们研究了如何训练一个准确且可靠的基于LLM的定理证明验证器。随后,我们以该验证器作为奖励模型训练证明生成器,并激励生成器在最终确定证明前尽可能识别并解决自身证明中的问题。为了在生成器能力增强时保持生成与验证之间的差距,我们提出扩展验证计算资源以自动标注新的难以验证的证明,从而创建训练数据以进一步提升验证器性能。我们最终得到的模型DeepSeekMath-V2展现出强大的定理证明能力,在扩展测试时计算资源下,于IMO 2025和CMO 2024中获得金牌级评分,并在Putnam 2024中取得接近满分的118/120分。