PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient algorithm that has been successfully applied to complex computer games such as Dota 2 and Honor of Kings. In these environments, an agent makes compound actions consisting of multiple sub-actions. PPO uses clipping to restrict policy updates. Although clipping is simple and effective, it is not efficient in its sample use. For compound actions, most PPO implementations consider the joint probability (density) of sub-actions, which means that if the ratio of a sample (state compound-action pair) exceeds the range, the gradient the sample produces is zero. Instead, for each sub-action we calculate the loss separately, which is less prone to clipping during updates thereby making better use of samples. Further, we propose a multi-action mixed loss that combines joint and separate probabilities. We perform experiments in Gym-$\mu$RTS and MuJoCo. Our hybrid model improves performance by more than 50\% in different MuJoCo environments compared to OpenAI's PPO benchmark results. And in Gym-$\mu$RTS, we find the sub-action loss outperforms the standard PPO approach, especially when the clip range is large. Our findings suggest this method can better balance the use-efficiency and quality of samples.
翻译:PPO(近端策略优化)是一种先进的策略梯度算法,已成功应用于Dota 2和《王者荣耀》等复杂电脑游戏。在这些环境中,智能体需执行由多个子动作组成的复合动作。PPO通过裁剪机制限制策略更新。尽管裁剪简单有效,但其样本利用效率不高。针对复合动作,大多数PPO实现考虑子动作的联合概率(密度),这意味着若样本(状态-复合动作对)的比率超出范围,则该样本产生的梯度为零。为此,我们对每个子动作单独计算损失,这种方式在更新时更不容易触发裁剪,从而更好地利用样本。此外,我们提出一种结合联合概率与分离概率的多动作混合损失。我们在Gym-$\mu$RTS和MuJoCo环境下进行实验。与OpenAI的PPO基准结果相比,我们的混合模型在不同MuJoCo环境中的性能提升超过50%。而在Gym-$\mu$RTS中,我们发现子动作损失优于标准PPO方法,尤其是在裁剪范围较大时。研究结果表明,该方法能够更好地平衡样本的利用效率与质量。