In this work, we present a scalable reinforcement learning method for training multi-task policies from large offline datasets that can leverage both human demonstrations and autonomously collected data. Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups. We therefore refer to the method as Q-Transformer. By discretizing each action dimension and representing the Q-value of each action dimension as separate tokens, we can apply effective high-capacity sequence modeling techniques for Q-learning. We present several design decisions that enable good performance with offline RL training, and show that Q-Transformer outperforms prior offline RL algorithms and imitation learning techniques on a large diverse real-world robotic manipulation task suite. The project's website and videos can be found at https://q-transformer.github.io
翻译:在本研究中,我们提出了一种可扩展的强化学习方法,用于从大规模离线数据集中训练多任务策略,该方法能够利用人类演示和自主收集的数据。我们的方法采用Transformer为通过离线时序差分备份训练的Q函数提供可扩展表示,因此将其命名为Q-Transformer。通过离散化每个动作维度并将每个动作维度的Q值表示为独立令牌,我们能够将高效的大容量序列建模技术应用于Q学习。我们提出了若干设计决策以实现良好的离线强化学习训练性能,并证明Q-Transformer在多样化的大规模真实世界机器人操作任务套件上优于先前的离线强化学习算法和模仿学习技术。项目网站及视频请参见https://q-transformer.github.io