3D-VLA: A 3D Vision-Language-Action Generative World Model

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

翻译：近期视觉-语言-动作（VLA）模型依赖二维输入，缺乏与三维物理世界更广泛领域的整合。此外，这些模型通过学习从感知到动作的直接映射进行动作预测，忽略了世界丰富的动态特性以及动作与动态之间的关系。相比之下，人类天生具备世界模型，能够描绘对未来场景的想象，从而规划相应行动。为此，我们提出了3D-VLA，通过引入一类新的具身基础模型，将三维感知、推理和动作通过生成式世界模型无缝衔接。具体而言，3D-VLA构建于基于三维的大型语言模型（LLM）之上，并引入一组交互令牌以与具身环境交互。此外，为赋予模型生成能力，我们训练了一系列具身扩散模型，并将其与LLM对齐，以预测目标图像和点云。为训练3D-VLA，我们通过从现有机器人数据集中提取大量三维相关信息，整理了一个大规模三维具身指令数据集。我们在保留数据集上的实验表明，3D-VLA显著提升了具身环境中的推理、多模态生成和规划能力，展现了其在现实应用中的潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日