Sequential decision-making algorithms such as reinforcement learning (RL) in real-world scenarios inevitably face environments with partial observability. This paper scrutinizes the effectiveness of a popular architecture, namely Transformers, in Partially Observable Markov Decision Processes (POMDPs) and reveals its theoretical limitations. We establish that regular languages, which Transformers struggle to model, are reducible to POMDPs. This poses a significant challenge for Transformers in learning POMDP-specific inductive biases, due to their lack of inherent recurrence found in other models like RNNs. This paper casts doubt on the prevalent belief in Transformers as sequence models for RL and proposes to introduce a point-wise recurrent structure. The Deep Linear Recurrent Unit (LRU) emerges as a well-suited alternative for Partially Observable RL, with empirical results highlighting the sub-optimal performance of the Transformer and considerable strength of LRU.
翻译:在现实场景中,诸如强化学习(RL)等序列决策算法不可避免地会面临具有部分可观测性的环境。本文深入探讨了流行架构Transformer在部分可观测马尔可夫决策过程(POMDPs)中的有效性,并揭示了其理论局限性。我们证明,Transformer难以建模的正规语言可归约至POMDPs。由于Transformer缺乏如RNNs等其他模型中固有的循环结构,这对其学习POMDP特定归纳偏置构成了重大挑战。本文对当前普遍认为Transformer可作为RL序列模型的观念提出质疑,并建议引入逐点循环结构。深度线性循环单元(LRU)成为部分可观测RL的合适替代方案,实证结果突显了Transformer的次优性能以及LRU的显著优势。