Self-supervised learning has brought about a revolutionary paradigm shift in various computing domains, including NLP, vision, and biology. Recent approaches involve pre-training transformer models on vast amounts of unlabeled data, serving as a starting point for efficiently solving downstream tasks. In reinforcement learning, researchers have recently adapted these approaches, developing models pre-trained on expert trajectories. This advancement enables the models to tackle a broad spectrum of tasks, ranging from robotics to recommendation systems. However, existing methods mostly rely on intricate pre-training objectives tailored to specific downstream applications. This paper conducts a comprehensive investigation of models, referred to as pre-trained action-state transformer agents (PASTA). Our study covers a unified methodology and covers an extensive set of general downstream tasks including behavioral cloning, offline RL, sensor failure robustness, and dynamics change adaptation. Our objective is to systematically compare various design choices and offer valuable insights that will aid practitioners in developing robust models. Key highlights of our study include tokenization at the component level for actions and states, the use of fundamental pre-training objectives such as next token prediction or masked language modeling, simultaneous training of models across multiple domains, and the application of various fine-tuning strategies. In this study, the developed models contain fewer than 7 million parameters allowing a broad community to use these models and reproduce our experiments. We hope that this study will encourage further research into the use of transformers with first principle design choices to represent RL trajectories and contribute to robust policy learning.
翻译:自监督学习已在包括自然语言处理、计算机视觉和生物学在内的多个计算领域引发了革命性的范式转变。近期的方法涉及在大量未标注数据上预训练Transformer模型,并将其作为有效解决下游任务的起点。在强化学习中,研究人员近期将此类方法改造用于预训练基于专家轨迹的模型。这一进展使模型能够解决从机器人技术到推荐系统的广泛任务。然而,现有方法大多依赖于针对特定下游应用设计的复杂预训练目标。本文对一类称为预训练动作-状态Transformer智能体的模型进行了系统研究。我们的研究涵盖统一方法论,并涉及广泛的一般性下游任务,包括行为克隆、离线强化学习、传感器故障鲁棒性以及动力学变化适应。我们的目标在于系统对比各种设计选择,并为从业者开发鲁棒模型提供宝贵见解。本研究的核心亮点包括:对动作和状态进行组件级分词、采用如下一标记预测或掩码语言建模等基础预训练目标、跨多领域同步训练模型,以及应用多种微调策略。本研究开发的模型参数量均低于700万,这使得更广泛的研究群体能够使用这些模型并复现我们的实验。我们希望本研究能鼓励更多将Transformer与简约设计原则结合来表征强化学习轨迹的研究,从而推动鲁棒策略学习的发展。