Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e.g., 3D grounding, embodied reasoning and acting. We argue these limitations significantly hinder current models from performing real-world tasks and approaching general intelligence. To this end, we introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. LEO is trained with a unified task interface, model architecture, and objective in two stages: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning. We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world. Moreover, we meticulously design an LLM-assisted pipeline to produce high-quality 3D VL data. Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation. Our ablative studies and scaling analyses further provide valuable insights for developing future embodied generalist agents. Code and data are available on project page.
翻译:借助大规模语言模型的海量知识,近期机器学习模型在计算机视觉和机器人学等领域的通用任务求解中展现出显著成功。然而,仍存在若干关键挑战:(i) 多数模型依赖二维图像,对三维输入的应对能力有限;(ii) 这些模型极少探索三维世界固有的任务,如三维定位、具身推理与行动。我们认为,这些限制严重阻碍现有模型执行现实世界任务并接近通用智能。为此,我们提出LEO——一个在三维世界中擅长感知、定位、推理、规划与行动的具身多模态通用智能体。LEO通过统一的任务接口、模型架构与目标分两阶段训练:(i) 三维视觉-语言对齐与(ii) 三维视觉-语言-动作指令微调。我们收集了涵盖多样化物体级与场景级任务的大规模数据集,这些任务要求对三维世界进行深入理解与交互。此外,我们精心设计了基于大语言模型的流水线以生成高质量三维视觉-语言数据。通过广泛实验,我们展示了LEO在三维字幕生成、问答、具身推理、导航与操作等广泛任务中的卓越能力。消融实验与规模分析进一步为开发未来具身通用智能体提供了宝贵见解。代码与数据见项目主页。