Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics. However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game." Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand. In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation. Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback. In addition, we use a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, empowering our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning. Lastly, we develop three open-world evaluation benchmarks, then carry out extensive experiments from a wide range of perspectives to validate our model's capability to strategically act and plan. Codes and datasets will be released.
翻译:近期研究提供了有力证据,表明大语言模型(LLM)能使具身智能体具备与世界交互的自主驱动力,这标志着向通用机器人迈出了初步一步。然而,这些工作往往忽视了开放世界的视觉丰富性,使得整个交互过程类似"蒙眼文字游戏"。因此,基于LLM的智能体经常难以直觉地理解其周围环境并生成易于理解的响应。本文提出Steve-Eye——一种端到端训练的大规模多模态模型,旨在解决这一局限。Steve-Eye将LLM与视觉编码器相结合,使其能够处理视觉-文本输入并生成多模态反馈。此外,我们采用半自动策略收集包含85万对开放世界指令的大规模数据集,使我们的模型能够涵盖智能体的三个核心功能:多模态感知、基础知识库、技能预测与规划。最后,我们开发了三个开放世界评估基准,并从多角度开展广泛实验,验证模型在策略性行动与规划方面的能力。代码与数据集将公开发布。