Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.
翻译:近期视觉-语言-动作(VLA)模型依赖二维输入,缺乏与三维物理世界更广泛领域的整合。此外,这些模型通过学习从感知到动作的直接映射进行动作预测,忽略了世界丰富的动态特性以及动作与动态之间的关系。相比之下,人类天生具备世界模型,能够描绘对未来场景的想象,从而规划相应行动。为此,我们提出了3D-VLA,通过引入一类新的具身基础模型,将三维感知、推理和动作通过生成式世界模型无缝衔接。具体而言,3D-VLA构建于基于三维的大型语言模型(LLM)之上,并引入一组交互令牌以与具身环境交互。此外,为赋予模型生成能力,我们训练了一系列具身扩散模型,并将其与LLM对齐,以预测目标图像和点云。为训练3D-VLA,我们通过从现有机器人数据集中提取大量三维相关信息,整理了一个大规模三维具身指令数据集。我们在保留数据集上的实验表明,3D-VLA显著提升了具身环境中的推理、多模态生成和规划能力,展现了其在现实应用中的潜力。