Despite extreme sample inefficiency, on-policy reinforcement learning, aka policy gradients, has become a fundamental tool in decision-making problems. With the recent advances in GPU-driven simulation, the ability to collect large amounts of data for RL training has scaled exponentially. However, we show that current RL methods, e.g. PPO, fail to ingest the benefit of parallelized environments beyond a certain point and their performance saturates. To address this, we propose a new on-policy RL algorithm that can effectively leverage large-scale environments by splitting them into chunks and fusing them back together via importance sampling. Our algorithm, termed SAPG, shows significantly higher performance across a variety of challenging environments where vanilla PPO and other strong baselines fail to achieve high performance. Website at https://sapg-rl.github.io/
翻译:尽管样本效率极低,同策略强化学习(即策略梯度)已成为决策问题中的基本工具。随着GPU驱动仿真技术的最新进展,为强化学习训练收集大量数据的能力呈指数级增长。然而,我们发现当前强化学习方法(如PPO)无法有效利用并行环境超过特定阈值,其性能会达到饱和。为解决这一问题,我们提出了一种新的同策略强化学习算法,该算法通过将大规模环境分割为数据块并利用重要性采样将其融合,从而有效利用大规模环境。我们的算法(称为SAPG)在一系列具有挑战性的环境中展现出显著更高的性能,而标准PPO及其他强基线方法在这些环境中均无法实现高性能。项目网站:https://sapg-rl.github.io/