Deep reinforcement learning repeatedly succeeds in closed, well-defined domains such as games (Chess, Go, StarCraft). The next frontier is real-world scenarios, where setups are numerous and varied. For this, agents need to learn the underlying rules governing the environment, so as to robustly generalise to conditions that differ from those they were trained on. Model-based reinforcement learning algorithms, such as the highly successful MuZero, aim to accomplish this by learning a world model. However, leveraging a world model has not consistently shown greater generalisation capabilities compared to model-free alternatives. In this work, we propose improving the data efficiency and generalisation capabilities of MuZero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by MuZero are equivariant to a particular symmetry group acting on the environment, the entirety of MuZero's action-selection algorithm will also be equivariant to that group. We evaluate Equivariant MuZero on procedurally-generated MiniPacman and on Chaser from the ProcGen suite: training on a set of mazes, and then testing on unseen rotated versions, demonstrating the benefits of equivariance. Further, we verify that our performance improvements hold even when only some of the components of Equivariant MuZero obey strict equivariance, which highlights the robustness of our construction.
翻译:深度强化学习在诸如象棋、围棋、星际争霸等封闭且定义明确的领域中屡获成功。下一个前沿领域是现实世界场景,其中设置多种多样且变化万千。为此,智能体需要学习支配环境的基本规则,以便能够稳健地泛化到与训练条件不同的环境中。基于模型的强化学习算法,例如非常成功的MuZero,旨在通过学习世界模型来实现这一目标。然而,与无模型替代方法相比,利用世界模型并未始终展现出更强的泛化能力。在这项工作中,我们提出通过在其世界模型架构中显式地融入环境的对称性,来提高MuZero的数据效率和泛化能力。我们证明,只要MuZero使用的神经网络对作用于环境的特定对称群是等变的,那么MuZero的整个动作选择算法也将对该群是等变的。我们在程序生成的MiniPacman和ProcGen套件中的Chaser上评估了等变MuZero:在一组迷宫中训练,然后在未见过的旋转版本上进行测试,从而证明了等变性的优势。此外,我们验证了即使仅当等变MuZero的某些组件严格遵循等变性时,我们的性能提升仍然成立,这突显了我们构造的鲁棒性。