Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm known for its stability and sample efficiency, but it often suffers from premature convergence due to limited exploration. In this paper, we propose POEM (Proximal Policy Optimization with Evolutionary Mutations), a novel modification to PPO that introduces an adaptive exploration mechanism inspired by evolutionary algorithms. POEM enhances policy diversity by monitoring the Kullback-Leibler (KL) divergence between the current policy and a moving average of previous policies. When policy changes become minimal, indicating stagnation, POEM triggers an adaptive mutation of policy parameters to promote exploration. We evaluate POEM on four OpenAI Gym environments: CarRacing, MountainCar, BipedalWalker, and LunarLander. Through extensive fine-tuning using Bayesian optimization techniques and statistical testing using Welch's t-test, we find that POEM significantly outperforms PPO on three of the four tasks (BipedalWalker: t=-2.0642, p=0.0495; CarRacing: t=-6.3987, p=0.0002; MountainCar: t=-6.2431, p<0.0001), while performance on LunarLander is not statistically significant (t=-1.8707, p=0.0778). Our results highlight the potential of integrating evolutionary principles into policy gradient methods to overcome exploration-exploitation tradeoffs.
翻译:近端策略优化(PPO)是一种广泛使用的强化学习算法,以其稳定性和样本效率著称,但常因探索能力有限而陷入早熟收敛。本文提出POEM(基于进化突变的近端策略优化),这是一种受进化算法启发、引入自适应探索机制的新型PPO改进方法。POEM通过监控当前策略与历史策略移动平均之间的Kullback-Leibler(KL)散度来增强策略多样性。当策略变化趋于停滞时,POEM会触发策略参数的自适应突变以促进探索。我们在四个OpenAI Gym环境中评估POEM:CarRacing、MountainCar、BipedalWalker和LunarLander。通过贝叶斯优化技术进行大量超参数微调,并采用韦尔奇t检验进行统计测试,我们发现POEM在四项任务中的三项显著优于PPO(BipedalWalker:t=-2.0642,p=0.0495;CarRacing:t=-6.3987,p=0.0002;MountainCar:t=-6.2431,p<0.0001),而在LunarLander任务上的性能差异未达到统计显著性(t=-1.8707,p=0.0778)。本研究结果揭示了将进化原理融入策略梯度方法以克服探索-利用权衡的潜力。