RDT-1B：用于双手操作的扩散基础模型 (RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation)

Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to https://rdt-robotics.github.io/rdt-robotics/ for the code and videos.

翻译：双手操作在机器人学中至关重要，然而，由于协调两个机器人手臂固有的复杂性（导致多模态动作分布）以及训练数据的稀缺，开发基础模型极具挑战性。本文提出了机器人扩散Transformer（RDT），这是一种用于双手操作的先驱性扩散基础模型。RDT基于扩散模型来有效表征多模态性，并创新性地设计了一个可扩展的Transformer，以处理多模态输入的异质性，并捕捉机器人数据的非线性与高频特性。为解决数据稀缺问题，我们进一步引入了一个物理可解释的统一动作空间，该空间能够统一不同机器人的动作表示，同时保留原始动作的物理意义，从而促进可迁移物理知识的学习。基于这些设计，我们成功在迄今为止最大的多机器人数据集集合上对RDT进行了预训练，并将其规模扩展至12亿参数，这是目前最大的基于扩散的机器人操作基础模型。最后，我们在一个自建的包含超过6000条轨迹的多任务双手数据集上对RDT进行了微调，以精炼其操作能力。在真实机器人上的实验表明，RDT显著优于现有方法。它展现出对未见过的物体和场景的零样本泛化能力，能够理解并遵循语言指令，仅需1~5次演示即可学习新技能，并能有效处理复杂的灵巧任务。代码和视频请访问 https://rdt-robotics.github.io/rdt-robotics/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日