Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to https://rdt-robotics.github.io/rdt-robotics/ for the code and videos.
翻译:双手操作在机器人学中至关重要,然而,由于协调两个机器人手臂固有的复杂性(导致多模态动作分布)以及训练数据的稀缺,开发基础模型极具挑战性。本文提出了机器人扩散Transformer(RDT),这是一种用于双手操作的先驱性扩散基础模型。RDT基于扩散模型来有效表征多模态性,并创新性地设计了一个可扩展的Transformer,以处理多模态输入的异质性,并捕捉机器人数据的非线性与高频特性。为解决数据稀缺问题,我们进一步引入了一个物理可解释的统一动作空间,该空间能够统一不同机器人的动作表示,同时保留原始动作的物理意义,从而促进可迁移物理知识的学习。基于这些设计,我们成功在迄今为止最大的多机器人数据集集合上对RDT进行了预训练,并将其规模扩展至12亿参数,这是目前最大的基于扩散的机器人操作基础模型。最后,我们在一个自建的包含超过6000条轨迹的多任务双手数据集上对RDT进行了微调,以精炼其操作能力。在真实机器人上的实验表明,RDT显著优于现有方法。它展现出对未见过的物体和场景的零样本泛化能力,能够理解并遵循语言指令,仅需1~5次演示即可学习新技能,并能有效处理复杂的灵巧任务。代码和视频请访问 https://rdt-robotics.github.io/rdt-robotics/。