While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
翻译:尽管强化学习在语言模型推理方面取得了显著进展,但其发展受限于对可验证奖励的需求。近期无需验证器的强化学习方法通过利用大语言模型生成参考答案的概率作为奖励信号,解决了这一限制。然而,这些方法通常仅基于问题对推理轨迹进行采样。这种设计将推理轨迹采样与答案信息解耦,导致探索效率低下以及轨迹与最终答案之间的不一致性。本文提出\textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning}(CoVRL),该方法通过混合采样策略耦合先验分布与后验分布,从而在变分推断与强化学习之间建立桥梁。通过构建并优化一个整合了这两种分布的复合分布,CoVRL能够在保持强思维-答案一致性的同时实现高效探索。在数学与通用推理基准上的大量实验表明,CoVRL相比基础模型性能提升了12.4%,并在当前最先进的无需验证器强化学习基线方法基础上进一步取得了2.3%的性能提升,为增强语言模型的通用推理能力提供了一个原则性框架。