Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy correction (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational overhead for large language models training. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, accelerating training by 1.8x speedup while maintaining comparable performance. Code \& off-the-shelf example are available at: https://github.com/inclusionAI/AReaL/blob/main/docs/algorithms/prox_approx.md
翻译:解耦PPO已成为一种成功的强化学习算法,用于处理异步强化学习设置下的高数据陈旧性问题。解耦PPO中采用的解耦损失通过引入近端策略,将离策略修正(重要性权重)与策略更新约束(信任区域)解耦,从而提升了耦合损失类算法(如标准PPO、GRPO)的学习稳定性。然而,近端策略需要在每个训练步骤中对模型执行额外的前向传播,这给大语言模型训练带来了计算开销。我们观察到,由于近端策略仅作为行为策略与目标策略之间的信任区域锚点,我们可以通过简单插值进行逼近而无需显式计算。我们将该方法称为A-3PO(逼近式近端策略优化)。A-3PO消除了这一开销,在保持相当性能的同时实现了1.8倍的训练加速。代码及开箱即用示例可见:https://github.com/inclusionAI/AReaL/blob/main/docs/algorithms/prox_approx.md