While off-policy reinforcement learning (RL) algorithms are sample efficient due to gradient-based updates and data reuse in the replay buffer, they struggle with convergence to local optima due to limited exploration. On the other hand, population-based algorithms offer a natural exploration strategy, but their heuristic black-box operators are inefficient. Recent algorithms have integrated these two methods, connecting them through a shared replay buffer. However, the effect of using diverse data from population optimization iterations on off-policy RL algorithms has not been thoroughly investigated. In this paper, we first analyze the use of off-policy RL algorithms in combination with population-based algorithms, showing that the use of population data could introduce an overlooked error and harm performance. To test this, we propose a uniform and scalable training design and conduct experiments on our tailored framework in robot locomotion tasks from the OpenAI gym. Our results substantiate that using population data in off-policy RL can cause instability during training and even degrade performance. To remedy this issue, we further propose a double replay buffer design that provides more on-policy data and show its effectiveness through experiments. Our results offer practical insights for training these hybrid methods.
翻译:尽管离策略强化学习算法通过基于梯度的更新和回放缓冲区中的数据重用而具有样本效率,但由于探索有限,它们难以收敛到局部最优。另一方面,基于种群的算法提供了天然的探索策略,但其启发式黑箱算子效率低下。近期算法已将这两种方法整合,通过共享回放缓冲区连接它们。然而,来自种群优化迭代的多样化数据对离策略强化学习算法的影响尚未被深入研究。本文首先分析了离策略强化学习算法与基于种群算法结合使用的情况,表明种群数据的使用可能引入被忽视的误差并损害性能。为验证这一点,我们提出了一种统一且可扩展的训练设计,并在OpenAI gym中的机器人运动任务上对我们定制的框架进行了实验。结果证实,在离策略强化学习中使用种群数据会导致训练不稳定甚至性能下降。为解决此问题,我们进一步提出了一种双回放缓冲区设计,以提供更多在策略数据,并通过实验证明了其有效性。我们的结果为训练这些混合方法提供了实用见解。