Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. Furthermore, when seamlessly integrated into an embodied agent, it signifies a crucial stride towards the creation of autonomous and context-aware systems capable of formulating plans and executing commands with precision. In this paper, we introduce Octopus, a novel VLM designed to proficiently decipher an agent's vision and textual task objectives and to formulate intricate action sequences and generate executable code. Our design allows the agent to adeptly handle a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games. Octopus is trained by leveraging GPT-4 to control an explorative agent to generate training data, i.e., action blueprints and the corresponding executable code, within our experimental environment called OctoVerse. We also collect the feedback that allows the enhanced training scheme of Reinforcement Learning with Environmental Feedback (RLEF). Through a series of experiments, we illuminate Octopus's functionality and present compelling results, and the proposed RLEF turns out to refine the agent's decision-making. By open-sourcing our model architecture, simulator, and dataset, we aspire to ignite further innovation and foster collaborative applications within the broader embodied AI community.
翻译:大型视觉语言模型在多模态感知与推理方面取得了显著进展。当无缝集成到具身代理中时,这标志着向创建能够精确制定计划并执行命令的自主且情境感知系统迈出了关键一步。本文提出Octopus——一种新型视觉语言模型,旨在高效解析代理的视觉与文本任务目标,生成复杂动作序列并编写可执行代码。我们的设计使代理能够灵活处理从模拟器中的日常琐事到复杂视频游戏中的高级交互等广泛任务。Octopus通过利用GPT-4控制探索性代理,在我们的实验环境OctoVerse中生成训练数据(即动作蓝图与对应可执行代码)进行训练。我们还收集反馈信号,实现了基于环境反馈的强化学习增强训练方案。通过系列实验,我们阐明了Octopus的功能性并展示了令人信服的结果,所提出的RLEF方法有效优化了代理的决策能力。通过开源模型架构、模拟器与数据集,我们期望激发具身人工智能社区的创新浪潮并促进协作应用。