Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly finetuned on robot data after pre-trained on a large-scale video dataset. We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%. In the setting of zero-shot unseen scene generalization, GR-1 improves the success rate from 53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline methods and shows strong potentials in generalization to unseen scenes and objects. We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation. Project page: https://GR1-Manipulation.github.io
翻译:生成式预训练模型通过学习有用的表征,已在语言和视觉领域展现出显著有效性。本文通过证明视觉机器人操作能够从大规模视频生成式预训练中显著受益,拓展了这一有效性的范围。我们提出GR-1,一个简洁的GPT风格模型,专为多任务语言条件化视觉机器人操作设计。GR-1以语言指令、观察图像序列和机器人状态序列作为输入,并以端到端方式预测机器人动作及未来图像。凭借灵活的设计,GR-1在大规模视频数据集上预训练后,可无缝微调至机器人数据。我们在具有挑战性的CALVIN基准测试和真实机器人上进行了大量实验。在CALVIN基准测试中,我们的方法优于最先进的基线方法,将成功率从88.9%提升至94.9%。在零样本未见场景泛化设置中,GR-1将成功率从53.3%提升至85.4%。在真实机器人实验中,GR-1同样优于基线方法,并在泛化至未见场景和物体方面展现出强大潜力。我们提供了初步证据,表明一个统一的GPT风格Transformer,结合大规模视频生成式预训练,能显著泛化至多任务视觉机器人操作。项目页面:https://GR1-Manipulation.github.io