Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, inherent to RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor $\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while providing more stable and efficient training. Extensive experiments demonstrate that PACS significantly outperforms strong open-source models and RLVR baselines, yielding substantial average gains of $\textbf{+8.26\%}$ (4B) and $\textbf{+9.57\%}$ (8B) over base models offering a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.
翻译:可验证奖励强化学习(RLVR)的最新进展使大语言模型(LLMs)能够应对数学和编程等复杂推理任务。尽管前景广阔,但RLVR范式仍面临重大挑战,因为现有方法常受限于稀疏奖励信号和不稳定的策略梯度更新,这是基于强化学习方法的固有问题。为应对这些挑战,我们提出$\textbf{PACS}$——一种新颖的RLVR框架,通过$\textbf{S}$监督学习框架实现隐式$\textbf{P}$$\textbf{A}$ctor-$\textbf{C}$ritic耦合。通过将结果奖励视为可预测的标签,我们将RLVR问题重新表述为对策略模型参数化的评分函数进行监督学习的任务,并使用交叉熵损失进行优化。详细的梯度分析表明,这种监督式表述本质上恢复了经典策略梯度更新,同时提供更稳定高效的训练。大量实验证明,PACS显著优于强大的开源模型和RLVR基线,在基础模型上实现了$\textbf{+8.26\%}$(4B)和$\textbf{+9.57\%}$(8B)的平均性能提升,为LLMs在可验证奖励下的后训练提供了有前景的新途径。我们的代码和数据已在https://github.com/ritzz-ai/PACS开源发布。