Multimodal Large Language Models (MLLMs), building upon the powerful Large Language Models (LLMs) with exceptional reasoning and generalization capability, have opened up new avenues for embodied task planning. MLLMs excel in their ability to integrate diverse environmental inputs, such as real-time task progress, visual observations, and open-form language instructions, which are crucial for executable task planning. In this work, we introduce a benchmark with human annotations, EgoPlan-Bench, to quantitatively investigate the potential of MLLMs as embodied task planners in real-world scenarios. Our benchmark is distinguished by realistic tasks derived from real-world videos, a diverse set of actions involving interactions with hundreds of different objects, and complex visual observations from varied environments. We evaluate various open-source MLLMs, revealing that these models have not yet evolved into embodied planning generalists (even GPT-4V). We further construct an instruction-tuning dataset EgoPlan-IT from videos of human-object interactions, to facilitate the learning of high-level task planning in intricate real-world situations. The experiment results demonstrate that the model tuned on EgoPlan-IT not only significantly improves performance on our benchmark, but also effectively acts as embodied planner in simulations.
翻译:多模态大语言模型(MLLMs)基于强大的大语言模型(LLMs)及其卓越的推理与泛化能力,为具身任务规划开辟了新途径。MLLMs擅长整合多样化的环境输入信息,例如实时任务进展、视觉观测及开放式语言指令,这些信息对可执行的任务规划至关重要。本文提出了一个带有手工标注的基准数据集EgoPlan-Bench,旨在定量探究MLLMs作为现实场景中具身任务规划器的潜力。该基准的特色在于:任务源于真实世界视频、涉及与数百种不同物体交互的多样化动作、以及来自多变环境的复杂视觉观测。我们对多种开源MLLMs进行了评估,结果显示这些模型(包括GPT-4V)尚未进化为具身规划通才。为进一步促进在复杂真实场景中高层任务规划的学习,我们基于人-物交互视频构建了指令微调数据集EgoPlan-IT。实验结果表明,经过EgoPlan-IT微调的模型不仅在本基准测试上表现显著提升,还能在仿真环境中有效充当具身规划器。