With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.
翻译:摘要:随着大语言模型的蓬勃发展,具身智能日益受到关注。然而,现有具身智能研究通常以单模态方式(视觉或语言)编码场景或历史记忆,这导致模型的动作规划与具身控制难以对齐。为克服这一局限,我们提出多模态具身交互智能体(MEIA),能够将自然语言表达的高层级任务转化为可执行动作序列。具体而言,我们创新性地设计了多模态环境记忆(MEM)模块,通过场景的视觉-语言记忆促进具身控制与大模型的融合。该能力使MEIA能根据多样化需求及机器人自身能力生成可执行动作规划。此外,我们借助大语言模型构建了基于动态虚拟咖啡厅环境的具身问答数据集。在该虚拟环境中,我们通过零样本学习调用多种大模型开展系列实验,并针对不同场景精心设计实验方案。实验结果表明,我们的MEIA在多种具身交互任务中展现出优异性能。