Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics. However, a significant challenge remains as these models exhibit limited ability in understanding and interacting with the 3D world. We argue this limitation significantly hinders the current models from performing real-world tasks and further achieving general intelligence. To this end, we introduce an embodied multi-modal and multi-task generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. Our proposed agent, referred to as LEO, is trained with shared LLM-based model architectures, objectives, and weights in two stages: (i) 3D vision-language alignment and (ii) 3D vision-language-action instruction tuning. To facilitate the training, we meticulously curate and generate an extensive dataset comprising object-level and scene-level multi-modal tasks with exceeding scale and complexity, necessitating a deep understanding of and interaction with the 3D world. Through rigorous experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, embodied navigation, and robotic manipulation. Our ablation results further provide valuable insights for the development of future embodied generalist agents.
翻译:借助大规模语言模型(LLMs)的海量知识与学习范式,近期机器学习模型在构建具备跨领域通用任务解决能力的通用智能体方面取得了显著进展,涵盖自然语言处理、计算机视觉及机器人学等领域。然而,当前模型在理解与交互三维世界方面仍存在显著局限性。我们认为这一瓶颈严重阻碍了模型执行真实世界任务并最终实现通用智能的进程。为此,我们提出一种具身多模态多任务通用智能体,其在三维世界的感知、定位、推理、规划与行动能力方面表现卓越。该智能体名为LEO,采用基于LLM的共享模型架构、训练目标与权重,分两阶段训练:(i)三维视觉-语言对齐训练,(ii)三维视觉-语言-动作指令微调。为支撑训练,我们精心构建并生成了包含物体级与场景级多模态任务的大规模数据集,其规模与复杂度远超现有基准,要求对三维世界具备深度理解与交互能力。通过严格实验,我们验证了LEO在三维描述生成、问答、具身推理、具身导航及机器人操控等广泛任务中的卓越性能。消融实验结果更为未来具身通用智能体的研发提供了重要启示。