Self-supervised learning has brought about a revolutionary paradigm shift in various computing domains, including NLP, vision, and biology. Recent approaches involve pre-training transformer models on vast amounts of unlabeled data, serving as a starting point for efficiently solving downstream tasks. In the realm of reinforcement learning, researchers have recently adapted these approaches by developing models pre-trained on expert trajectories, enabling them to address a wide range of tasks, from robotics to recommendation systems. However, existing methods mostly rely on intricate pre-training objectives tailored to specific downstream applications. This paper presents a comprehensive investigation of models we refer to as Pretrained Action-State Transformer Agents (PASTA). Our study uses a unified methodology and covers an extensive set of general downstream tasks including behavioral cloning, offline RL, sensor failure robustness, and dynamics change adaptation. Our goal is to systematically compare various design choices and provide valuable insights to practitioners for building robust models. Key highlights of our study include tokenization at the action and state component level, using fundamental pre-training objectives like next token prediction, training models across diverse domains simultaneously, and using parameter efficient fine-tuning (PEFT). The developed models in our study contain fewer than 10 million parameters and the application of PEFT enables fine-tuning of fewer than 10,000 parameters during downstream adaptation, allowing a broad community to use these models and reproduce our experiments. We hope that this study will encourage further research into the use of transformers with first-principles design choices to represent RL trajectories and contribute to robust policy learning.
翻译:自监督学习已在包括自然语言处理、计算机视觉和生物学在内的多个计算领域引发了革命性范式转变。近期方法涉及在大量未标注数据上预训练变换器模型,将其作为高效解决下游任务的起始点。在强化学习领域,研究者通过开发在专家轨迹上预训练的模型,已将这些方法应用于从机器人技术到推荐系统的广泛任务。然而,现有方法大多依赖针对特定下游应用定制的复杂预训练目标。本文对我们称为“预训练动作-状态变换器智能体”(PASTA)的模型进行了全面研究。研究采用统一方法论,涵盖了广泛的一般性下游任务,包括行为克隆、离线强化学习、传感器故障鲁棒性及动态变化适应。我们的目标是系统比较各种设计选择,为从业者构建鲁棒模型提供宝贵见解。研究亮点包括在动作和状态组件层面的分词处理、使用基础预训练目标(如下一个标记预测)、同时在多个不同领域训练模型,以及应用参数高效微调(PEFT)。本研究所开发的模型参数少于1000万,且通过PEFT可在下游适配中微调少于1万个参数,这使得更广泛的研究群体能够使用这些模型并复现我们的实验。我们希望这项研究能鼓励进一步探索采用第一性原理设计选择的变换器来表示强化学习轨迹,并促进鲁棒策略学习。