Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method incorporating three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function (based on correctness) from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases. Experiments conducted on three datasets of varying difficulty (MinF2F-test, MathOlympiadBench, PutnamBench) confirm the effectiveness of GDEPO, while ablation studies validate the necessity of its synergistic components. The proposed method enhances data utilization and optimization efficiency, offering a novel training paradigm for ATP.
翻译:自动定理证明(ATP)是人工智能(AI)领域的一项基础性挑战,其目标是在Lean等形式化语言中构建机器可验证的证明,以评估AI的推理能力。强化学习(RL),尤其是高性能的组相对策略优化(GRPO)算法,已成为解决该任务的主流方法。然而,在ATP场景中,GRPO面临两个关键问题:当使用复合奖励时,其相对优势估计可能与形式化验证器提供的二元反馈相冲突;同时,其静态采样策略若未找到有效证明,则会丢弃整批数据,导致对模型更新的贡献为零,造成显著的数据浪费。为应对这些局限,我们提出了组双动态与等权优势策略优化(GDEPO),该方法包含三个核心机制:1)动态补充采样,对无效批次进行重采样直至发现有效证明;2)等权优势,将优势函数的符号(基于正确性)与其幅度(由辅助奖励调节)解耦,以确保策略更新的稳定与正确;3)动态补充迭代,对初始失败但最终成功的样本应用额外的梯度步,以加速在困难案例上的学习。在三个不同难度数据集(MinF2F-test、MathOlympiadBench、PutnamBench)上进行的实验证实了GDEPO的有效性,消融研究则验证了其协同组件的必要性。所提方法提升了数据利用率和优化效率,为ATP提供了一种新颖的训练范式。