Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, and determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capacity of RL algorithms, scaling up to tasks that require memorizing observations $1500$ steps ago. However, Transformers do not improve long-term credit assignment. In summary, our results provide an explanation for the success of Transformers in RL, while also highlighting an important area for future research and benchmark design.
翻译:强化学习算法面临两个不同的挑战:学习有效表征过去与当前观测,以及确定动作如何影响未来收益。这两个挑战都涉及长期依赖建模。Transformer架构在解决包含长期依赖的问题上取得了巨大成功,包括强化学习领域。然而,基于Transformer的强化学习算法表现出色的根本原因尚不明确:是因为它们学习了有效的记忆,还是因为它们进行了有效的信用分配?在引入记忆长度和信用分配长度的形式化定义后,我们设计了简单的可配置任务来测量这些不同量值。实证结果表明,Transformer能够增强强化学习算法的记忆容量,可扩展至需记忆1500步前观测的任务。但Transformer并未改进长期信用分配。综上,我们的结果为Transformer在强化学习中的成功提供了解释,同时为未来研究和基准设计指明了重要方向。