Recently, sequence learning methods have been applied to the problem of off-policy Reinforcement Learning, including the seminal work on Decision Transformers, which employs transformers for this task. Since transformers are parameter-heavy, cannot benefit from history longer than a fixed window size, and are not computed using recurrence, we set out to investigate the suitability of the S4 family of models, which are based on state-space layers and have been shown to outperform transformers, especially in modeling long-range dependencies. In this work we present two main algorithms: (i) an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model. (ii) An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism. Our results indicate that our method outperforms multiple variants of decision transformers, as well as the other baseline methods on most tasks, while reducing the latency, number of parameters, and training time by several orders of magnitude, making our approach more suitable for real-world RL.
翻译:最近,序列学习方法已被应用于离策略强化学习问题,其中具有里程碑意义的决策Transformer采用了Transformer架构处理该任务。由于Transformer模型参数庞大、无法利用超过固定窗口长度的历史信息、且非循环计算结构,我们开始研究基于状态空间层的S4系列模型的适用性——该模型已被证明在长程依赖建模方面优于Transformer。本文提出两大核心算法:(i)一种基于轨迹的离策略训练流程,同时保持S4模型的训练效率;(ii)一种基于新型稳定型演员-评论家机制的循环式在策略训练算法,兼具长程依赖优势。实验结果表明,我们的方法在多数任务上优于决策Transformer的多种变体及其他基线方法,同时将延迟、参数量和训练时间降低数个数量级,从而更适用于真实世界的强化学习场景。