Diffusion Transformer Policy: Scaling Diffusion Transformer for Generalist Visual-Language-Action Learning

Recent large visual-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict individual discretized or continuous action by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action sequence with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head for action embedding. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate the effectiveness and generalization of Diffusion Transformer Policy on Maniskill2, Libero, Calvin and SimplerEnv, as well as the real-world Franka arm, achieving consistent better performance on Real-to-Sim benchmark SimplerEnv, real-world Franka Arm and Libero compared to OpenVLA and Octo. Specifically, without bells and whistles, the proposed approach achieves state-of-the-art performance with only a single third-view camera stream in the Calvin task ABC->D, improving the average number of tasks completed in a row of 5 to 3.6, and the pretraining stage significantly facilitates the success sequence length on the Calvin by over 1.2. Project Page: https://zhihou7.github.io/dit_policy_vla/

翻译：近期在多样化机器人数据集上预训练的大型视觉-语言-动作模型已展现出通过少量领域内数据泛化至新环境的潜力。然而，这些方法通常通过小型动作头预测离散化或连续的单步动作，限制了其处理多样化动作空间的能力。相比之下，我们采用一个大型多模态扩散Transformer（称为扩散Transformer策略）对连续动作序列进行建模，其中我们直接通过大型Transformer模型而非小型动作头对动作块进行去噪以生成动作嵌入。通过利用Transformer的扩展能力，所提方法能够有效建模跨大规模多样化机器人数据集的连续末端执行器动作，并获得更优的泛化性能。大量实验证明了扩散Transformer策略在Maniskill2、Libero、Calvin及SimplerEnv等仿真环境以及真实世界Franka机械臂上的有效性和泛化能力，在Real-to-Sim基准SimplerEnv、真实世界Franka机械臂和Libero任务中均取得优于OpenVLA与Octo的稳定性能。具体而言，在不使用额外技巧的情况下，所提方法在Calvin任务ABC->D中仅使用单路第三视角相机流即达到最先进性能，将连续完成任务的平均数量从3.6提升至5，且预训练阶段使Calvin任务的成功序列长度显著提升超过1.2。项目页面：https://zhihou7.github.io/dit_policy_vla/