Recent large visual-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict discretized or continuous actions by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate Diffusion Transformer Policy pretrained on diverse robot data can generalize to different embodiments, including simulation environments like Maniskill2 and Calvin, as well as the real-world Franka arm. Specifically, without bells and whistles, the proposed approach achieves state-of-the-art performance with only a single third-view camera stream in the Calvin novel task setting (ABC->D), improving the average number of tasks completed in a row of 5 to 3.6, and the pretraining stage significantly facilitates the success sequence length on the Calvin by over 1.2. The code will be publicly available.
翻译:近期在多样化机器人数据集上预训练的大型视觉-语言动作模型已展现出通过少量域内数据泛化至新环境的潜力。然而,这些方法通常通过小型动作头预测离散化或连续动作,限制了处理多样化动作空间的能力。相比之下,我们采用大型多模态扩散Transformer对连续动作进行建模,称为Diffusion Transformer Policy,其中我们直接通过大型Transformer模型而非小型动作头对动作块进行去噪。通过利用Transformer的扩展能力,所提方法能够有效建模跨大型多样化机器人数据集的连续末端执行器动作,并实现更优的泛化性能。大量实验表明,在多样化机器人数据上预训练的Diffusion Transformer Policy能够泛化至不同实体平台,包括Maniskill2和Calvin等仿真环境,以及现实世界的Franka机械臂。具体而言,在不使用额外技巧的情况下,所提方法在Calvin新任务设置(ABC->D)中仅使用单目第三视角相机流即达到最先进性能,将连续完成任务的平均数量从5提升至3.6,且预训练阶段使Calvin平台上的成功序列长度显著提升超过1.2。代码将公开提供。