Despite its extreme sample inefficiency, on-policy reinforcement learning has become a fundamental tool in real-world applications. With recent advances in GPU-driven simulation, the ability to collect vast amounts of data for RL training has scaled exponentially. However, studies show that current on-policy methods, such as PPO, fail to fully leverage the benefits of parallelized environments, leading to performance saturation beyond a certain scale. In contrast, Evolutionary Algorithms (EAs) excel at increasing diversity through randomization, making them a natural complement to RL. However, existing EvoRL methods have struggled to gain widespread adoption due to their extreme sample inefficiency. To address these challenges, we introduce Evolutionary Policy Optimization (EPO), a novel policy gradient algorithm that combines the strengths of EA and policy gradients. We show that EPO significantly improves performance across diverse and challenging environments, demonstrating superior scalability with parallelized simulations.
翻译:尽管样本效率极低,同策略强化学习已成为现实应用中的基础工具。随着GPU驱动仿真的最新进展,为强化学习训练收集海量数据的能力已呈指数级增长。然而,研究表明当前同策略方法(如PPO)未能充分利用并行化环境的优势,导致性能在超过特定规模后趋于饱和。相比之下,进化算法通过随机化有效增强种群多样性,使其成为强化学习的天然补充。然而,现有进化强化学习方法因样本效率极低而难以获得广泛应用。为应对这些挑战,我们提出了进化策略优化——一种融合进化算法与策略梯度优势的新型策略梯度算法。我们证明,EPO在多样化和具有挑战性的环境中显著提升了性能,并展现出卓越的并行仿真可扩展性。