In this paper, we present DiffusionVLA, a novel framework that seamlessly combines the autoregression model with the diffusion model for learning visuomotor policy. Central to our approach is a next-token prediction objective, enabling the model to reason effectively over the user's query in the context of current observations. Subsequently, a diffusion model is attached to generate robust action outputs. To enhance policy learning through self-reasoning, we introduce a novel reasoning injection module that integrates reasoning phrases directly into the policy learning process. The whole framework is simple and flexible, making it easy to deploy and upgrade. We conduct extensive experiments using multiple real robots to validate the effectiveness of DiffusionVLA. Our tests include a challenging factory sorting task, where DiffusionVLA successfully categorizes objects, including those not seen during training. We observe that the reasoning module makes the model interpretable. It allows observers to understand the model thought process and identify potential causes of policy failures. Additionally, we test DiffusionVLA on a zero-shot bin-picking task, achieving 63.7\% accuracy on 102 previously unseen objects. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiffusionVLA can follow novel instructions and retain conversational ability. Notably, DiffusionVLA is data-efficient and fast at inference; our smallest DiffusionVLA-2B runs 82Hz on a single A6000 GPU and can train from scratch on less than 50 demonstrations for a complex task. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.
翻译:本文提出DiffusionVLA,一种将自回归模型与扩散模型无缝结合以学习视觉运动策略的新框架。我们的方法核心是采用下一词元预测目标,使模型能够在当前观测背景下有效推理用户查询。随后,连接扩散模型以生成鲁棒的动作输出。为通过自我推理增强策略学习,我们引入了一种新颖的推理注入模块,将推理短语直接整合到策略学习过程中。整个框架简洁灵活,易于部署和升级。我们使用多台真实机器人进行了广泛实验以验证DiffusionVLA的有效性。测试包括一项具有挑战性的工厂分拣任务,其中DiffusionVLA成功对物体进行分类,涵盖训练期间未见的物体。我们观察到推理模块使模型具备可解释性,允许观察者理解模型的思维过程并识别策略失败的可能原因。此外,我们在零样本箱内拣选任务上测试DiffusionVLA,对102个先前未见物体实现了63.7%的准确率。我们的方法展现出对视觉变化(如干扰物和新背景)的鲁棒性,并能轻松适应新的实体形态。同时,DiffusionVLA能够遵循新指令并保持对话能力。值得注意的是,DiffusionVLA具有数据高效性和快速推理特性;我们最小的DiffusionVLA-2B模型在单张A6000 GPU上运行频率达82Hz,且仅需少于50次演示即可从零开始训练复杂任务。最后,我们将模型参数量从20亿扩展到720亿,展示了模型规模增大带来的泛化能力提升。