Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities.
翻译:尽管大型语言模型(LLM)和大型多模态模型(LMM)取得了进展,但它们在与语言关联的、类人具身智能体中的整合仍不完善,这阻碍了在物理环境中执行复杂现实任务的能力。现有的整合方案通常开源程度有限,阻碍了该领域的集体进步。我们提出了LEGENT,一个用于基于LLM和LMM开发具身智能体的开放、可扩展平台。LEGENT采用双轨方法:一方面,提供一个包含可交互、可操作智能体的丰富交互式3D环境,并搭配用户友好的界面;另一方面,提供一个复杂的数据生成流水线,利用先进算法大规模利用模拟世界中的监督信号。在我们的实验中,一个基于LEGENT生成数据训练的初级视觉-语言-动作模型,在具身任务上超越了GPT-4V,展现了良好的泛化能力。