Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen-3B base, EMA-PG improves GRPO by an average of 33.3% across 7 datasets of Q&A with search engines, including 29.7% $\rightarrow$ 44.1% on HotpotQA, 27.4% $\rightarrow$ 40.1% on 2WikiMultiHopQA. Overall, we show that EMA-PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: https://github.com/LunjunZhang/ema-pg
翻译:强化学习(RL)使得大语言模型(LLMs)能够习得日益复杂的推理和智能体行为。在本工作中,我们提出了两种简单的技术来改进用于LLMs的策略梯度算法。首先,我们将RL过程中使用的固定锚点策略替换为指数移动平均(EMA)策略,类似于深度Q学习中的目标网络。其次,我们引入了Top-k KL估计器,它允许在精确KL和采样KL之间进行灵活的插值。我们推导了使用EMA锚点的稳定性条件;此外,我们证明了我们的Top-k KL估计器在任何k值下都能产生无偏的KL值和无偏的梯度,同时兼具精确KL的优势。当与GRPO结合时,这两种技术(EMA-PG)带来了显著的性能提升。在数学推理任务上,它使得经过R1蒸馏的Qwen-1.5B在OlympiadBench上达到了53.9%,而GRPO为50.8%。在智能体RL领域,以Qwen-3B为基础模型,EMA-PG在包含搜索引擎问答的7个数据集上平均比GRPO提升了33.3%,其中包括HotpotQA从29.7%提升至44.1%,2WikiMultiHopQA从27.4%提升至40.1%。总体而言,我们证明了EMA-PG是一种简单、有原则且强大的方法,可用于扩展LLMs的强化学习。代码:https://github.com/LunjunZhang/ema-pg