Q-learning played a foundational role in the field reinforcement learning (RL). However, TD algorithms with off-policy data, such as Q-learning, or nonlinear function approximation like deep neural networks require several additional tricks to stabilise training, primarily a replay buffer and target networks. Unfortunately, the delayed updating of frozen network parameters in the target network harms the sample efficiency and, similarly, the replay buffer introduces memory and implementation overheads. In this paper, we investigate whether it is possible to accelerate and simplify TD training while maintaining its stability. Our key theoretical result demonstrates for the first time that regularisation techniques such as LayerNorm can yield provably convergent TD algorithms without the need for a target network, even with off-policy data. Empirically, we find that online, parallelised sampling enabled by vectorised environments stabilises training without the need of a replay buffer. Motivated by these findings, we propose PQN, our simplified deep online Q-Learning algorithm. Surprisingly, this simple algorithm is competitive with more complex methods like: Rainbow in Atari, R2D2 in Hanabi, QMix in Smax, PPO-RNN in Craftax, and can be up to 50x faster than traditional DQN without sacrificing sample efficiency. In an era where PPO has become the go-to RL algorithm, PQN reestablishes Q-learning as a viable alternative.
翻译:Q学习在强化学习(RL)领域发挥了奠基性作用。然而,使用离策略数据的时序差分(TD)算法(如Q学习)或采用非线性函数逼近(如深度神经网络)时,需要多种额外技巧来稳定训练,主要包括经验回放缓冲区和目标网络。遗憾的是,目标网络中冻结网络参数的延迟更新会损害样本效率,类似地,经验回放缓冲区会引入内存和实现开销。本文研究是否能在保持稳定性的同时加速并简化TD训练。我们的关键理论结果首次证明:即使使用离策略数据,LayerNorm等正则化技术也能产生可证明收敛的TD算法,且无需目标网络。实证研究表明,通过向量化环境实现的在线并行采样能够稳定训练,无需经验回放缓冲区。基于这些发现,我们提出了PQN——一种简化的深度在线Q学习算法。令人惊讶的是,这种简单算法在多项任务中与更复杂的方法具有竞争力:包括Atari中的Rainbow、Hanabi中的R2D2、Smax中的QMix、Craftax中的PPO-RNN,并且在不牺牲样本效率的前提下,其训练速度可比传统DQN快达50倍。在PPO已成为主流RL算法的时代,PQN重新确立了Q学习作为可行替代方案的地位。