Recently, the use of transformers in offline reinforcement learning has become a rapidly developing area. This is due to their ability to treat the agent's trajectory in the environment as a sequence, thereby reducing the policy learning problem to sequence modeling. In environments where the agent's decisions depend on past events (POMDPs), capturing both the event itself and the decision point in the context of the model is essential. However, the quadratic complexity of the attention mechanism limits the potential for context expansion. One solution to this problem is to enhance transformers with memory mechanisms. This paper proposes a Recurrent Action Transformer with Memory (RATE), a novel model architecture incorporating a recurrent memory mechanism designed to regulate information retention. To evaluate our model, we conducted extensive experiments on memory-intensive environments (ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid.Memory), classic Atari games and MuJoCo control environments. The results show that using memory can significantly improve performance in memory-intensive environments while maintaining or improving results in classic environments. We hope our findings will stimulate research on memory mechanisms for transformers applicable to offline reinforcement learning.
翻译:近年来,变换器在离线强化学习中的应用已成为一个快速发展的领域。这得益于其能够将智能体在环境中的轨迹视为序列,从而将策略学习问题简化为序列建模。在智能体决策依赖于过往事件的环境中(部分可观测马尔可夫决策过程),在模型上下文中同时捕获事件本身与决策点至关重要。然而,注意力机制的二次复杂度限制了上下文扩展的潜力。解决此问题的一种方案是通过记忆机制增强变换器。本文提出了一种循环记忆动作变换器(RATE),这是一种新颖的模型架构,其整合了旨在调控信息保留的循环记忆机制。为评估我们的模型,我们在记忆密集型环境(ViZDoom-Two-Colors、T-Maze、Memory Maze、Minigrid.Memory)、经典Atari游戏以及MuJoCo控制环境中进行了大量实验。结果表明,使用记忆机制能显著提升在记忆密集型环境中的性能,同时在经典环境中保持或改进结果。我们希望我们的发现能够推动适用于离线强化学习的变换器记忆机制研究。