With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. We conduct experiments in a dynamic virtual cafe environment, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.
翻译:随着大型语言模型的迅猛发展,具身智能日益受到关注。然而,现有具身智能工作通常以视觉或语言等单模态方式编码场景或历史记忆,这导致模型的动作规划与具身控制难以对齐。为突破这一局限,我们提出多模态具身交互智能体(MEIA),其能够将自然语言表述的高层任务转化为一系列可执行动作序列。具体而言,我们创新性地设计了多模态环境记忆(MEM)模块,通过场景的视觉-语言记忆促进具身控制与大模型的融合。该能力使MEIA能够根据多样化需求及机器人自身能力生成可执行的动作规划。我们在动态虚拟咖啡厅环境中开展实验,通过零样本学习调用多个大模型,并针对不同情境精心设计了测试场景。实验结果表明,MEIA在各类具身交互任务中展现出优异的性能。