Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.
翻译:奖励最大化的强化学习方法提升了大型语言模型的推理性能,但常常会降低输出多样性。近期研究通过采用GFlowNets来解决这一问题,训练大型语言模型以匹配目标分布,同时联合学习其配分函数。与先前仅将配分函数视为归一化因子的工作不同,我们将其重新解释为每个提示的期望奖励(即在线准确率)信号,利用这一未被使用的信息来提高样本效率。具体而言,我们首先建立了配分函数与每个提示准确率估计之间的理论关系。基于这一关键见解,我们提出了配分函数引导的强化学习(PACED-RL),这是一个后训练框架,利用准确率估计在训练过程中优先处理信息量大的问题提示,并通过准确率估计误差优先回放机制进一步提升样本效率。至关重要的是,这两个组件都重用了GFlowNet训练过程中已产生的信息,有效地将计算开销分摊到现有的优化过程中。在多样化基准测试上的大量实验表明,相较于GRPO和先前的GFlowNet方法,PACED-RL取得了显著的性能提升,凸显了其作为实现更高效样本利用的大型语言模型分布匹配训练的一个有前景的方向。