Transformers have demonstrated strong potential in offline reinforcement learning (RL) by modeling trajectories as sequences of return-to-go, states, and actions. However, existing approaches such as the Decision Transformer(DT) and its variants suffer from redundant tokenization and quadratic attention complexity, limiting their scalability in real-time or resource-constrained settings. To address this, we propose a Unified Token Representation (UTR) that merges return-to-go, state, and action into a single token, substantially reducing sequence length and model complexity. Theoretical analysis shows that UTR leads to a tighter Rademacher complexity bound, suggesting improved generalization. We further develop two variants: UDT and UDC, built upon transformer and gated CNN backbones, respectively. Both achieve comparable or superior performance to state-of-the-art methods with markedly lower computation. These findings demonstrate that UTR generalizes well across architectures and may provide an efficient foundation for scalable control in future large decision models.
翻译:Transformer 在离线强化学习(RL)中展现出强大潜力,其方法是将轨迹建模为回报-目标、状态和动作的序列。然而,现有方法如决策 Transformer(DT)及其变体存在令牌化冗余和注意力二次复杂度的问题,这限制了它们在实时或资源受限环境中的可扩展性。为解决此问题,我们提出了一种统一令牌表示(UTR),将回报-目标、状态和动作合并为单个令牌,从而显著减少了序列长度和模型复杂度。理论分析表明,UTR 带来了更紧的 Rademacher 复杂度上界,意味着泛化能力得到提升。我们进一步开发了两种变体:UDT 和 UDC,分别基于 Transformer 和门控 CNN 主干构建。两者均以显著更低的计算量实现了与最先进方法相当或更优的性能。这些发现表明,UTR 在不同架构间具有良好的泛化能力,并可能为未来大规模决策模型中的可扩展控制提供高效基础。