Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

In this paper, we present DiffusionVLA, a novel framework that seamlessly combines the autoregression model with the diffusion model for learning visuomotor policy. Central to our approach is a next-token prediction objective, enabling the model to reason effectively over the user's query in the context of current observations. Subsequently, a diffusion model is attached to generate robust action outputs. To enhance policy learning through self-reasoning, we introduce a novel reasoning injection module that integrates reasoning phrases directly into the policy learning process. The whole framework is simple and flexible, making it easy to deploy and upgrade. We conduct extensive experiments using multiple real robots to validate the effectiveness of DiffusionVLA. Our tests include a challenging factory sorting task, where DiffusionVLA successfully categorizes objects, including those not seen during training. We observe that the reasoning module makes the model interpretable. It allows observers to understand the model thought process and identify potential causes of policy failures. Additionally, we test DiffusionVLA on a zero-shot bin-picking task, achieving 63.7\% accuracy on 102 previously unseen objects. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiffusionVLA can follow novel instructions and retain conversational ability. Notably, DiffusionVLA is data-efficient and fast at inference; our smallest DiffusionVLA-2B runs 82Hz on a single A6000 GPU and can train from scratch on less than 50 demonstrations for a complex task. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.

翻译：本文提出DiffusionVLA，一种将自回归模型与扩散模型无缝结合以学习视觉运动策略的新框架。我们的方法核心是采用下一词元预测目标，使模型能够在当前观测背景下有效推理用户查询。随后，连接扩散模型以生成鲁棒的动作输出。为通过自我推理增强策略学习，我们引入了一种新颖的推理注入模块，将推理短语直接整合到策略学习过程中。整个框架简洁灵活，易于部署和升级。我们使用多台真实机器人进行了广泛实验以验证DiffusionVLA的有效性。测试包括一项具有挑战性的工厂分拣任务，其中DiffusionVLA成功对物体进行分类，涵盖训练期间未见的物体。我们观察到推理模块使模型具备可解释性，允许观察者理解模型的思维过程并识别策略失败的可能原因。此外，我们在零样本箱内拣选任务上测试DiffusionVLA，对102个先前未见物体实现了63.7%的准确率。我们的方法展现出对视觉变化（如干扰物和新背景）的鲁棒性，并能轻松适应新的实体形态。同时，DiffusionVLA能够遵循新指令并保持对话能力。值得注意的是，DiffusionVLA具有数据高效性和快速推理特性；我们最小的DiffusionVLA-2B模型在单张A6000 GPU上运行频率达82Hz，且仅需少于50次演示即可从零开始训练复杂任务。最后，我们将模型参数量从20亿扩展到720亿，展示了模型规模增大带来的泛化能力提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日