In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution $q_i \propto p_i^{\,\mathrm{old}} \exp(u_i)$ and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is $p^θ- q$, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.
翻译:在强化学习中,给定一个提示,我们从模型中采样一组完成序列并对其进行评分。随之产生两个问题:哪些完成序列应获得概率质量,以及参数应如何移动以实现该变化?标准策略梯度方法同时回答这两个问题,因此更新可能因学习率、裁剪及其他优化器选择而出现过冲或欠冲。我们提出了目标策略优化(TPO),该方法将这两个问题分开处理。给定评分完成序列,TPO构造一个目标分布$q_i \propto p_i^{\,\mathrm{old}} \exp(u_i)$,并通过交叉熵将策略拟合到该分布。采样完成序列逻辑值上的损失梯度为$p^θ - q$,当策略与目标匹配时该梯度消失。在表格型老虎机、Transformer序列任务以及十亿参数级LLM RLVR实验中,TPO在简单任务上与PG、PPO、GRPO和DG性能相当,在稀疏奖励下则显著优于它们。代码见https://github.com/JeanKaddour/tpo。