We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks. Videos are available at https://www.robot-learning.uk/keypoint-action-tokens.
翻译:我们证明,现成的基于文本的Transformer无需额外训练即可执行少样本语境内视觉模仿学习,将视觉观测映射为模仿演示者行为的动作序列。通过名为关键点动作令牌(KAT)的框架,我们将视觉观测(输入)和动作轨迹(输出)转化为文本预训练Transformer(GPT-4 Turbo)可接收和生成的令牌序列。尽管仅接受过语言训练,我们表明这些Transformer在将token化的视觉关键点观测转化为动作轨迹方面表现出色,在一系列真实世界日常任务的低数据场景下,其性能与最先进的模仿学习(扩散策略)相当或更优。不同于传统语言域操作,KAT利用基于文本的Transformer在视觉和动作域中学习演示数据中的通用模式,从而实现高效模仿学习,这为将自然语言模型重新应用于具身任务开辟了有前景的新途径。视频见https://www.robot-learning.uk/keypoint-action-tokens。