EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models

Multimodal Large Language Models (MLLMs), building upon the powerful Large Language Models (LLMs) with exceptional reasoning and generalization capability, have opened up new avenues for embodied task planning. MLLMs excel in their ability to integrate diverse environmental inputs, such as real-time task progress, visual observations, and open-form language instructions, which are crucial for executable task planning. In this work, we introduce a benchmark with human annotations, EgoPlan-Bench, to quantitatively investigate the potential of MLLMs as embodied task planners in real-world scenarios. Our benchmark is distinguished by realistic tasks derived from real-world videos, a diverse set of actions involving interactions with hundreds of different objects, and complex visual observations from varied environments. We evaluate various open-source MLLMs, revealing that these models have not yet evolved into embodied planning generalists (even GPT-4V). We further construct an instruction-tuning dataset EgoPlan-IT from videos of human-object interactions, to facilitate the learning of high-level task planning in intricate real-world situations. The experiment results demonstrate that the model tuned on EgoPlan-IT not only significantly improves performance on our benchmark, but also effectively acts as embodied planner in simulations.

翻译：多模态大语言模型（MLLMs）基于强大的大语言模型（LLMs）及其卓越的推理与泛化能力，为具身任务规划开辟了新途径。MLLMs擅长整合多样化的环境输入信息，例如实时任务进展、视觉观测及开放式语言指令，这些信息对可执行的任务规划至关重要。本文提出了一个带有手工标注的基准数据集EgoPlan-Bench，旨在定量探究MLLMs作为现实场景中具身任务规划器的潜力。该基准的特色在于：任务源于真实世界视频、涉及与数百种不同物体交互的多样化动作、以及来自多变环境的复杂视觉观测。我们对多种开源MLLMs进行了评估，结果显示这些模型（包括GPT-4V）尚未进化为具身规划通才。为进一步促进在复杂真实场景中高层任务规划的学习，我们基于人-物交互视频构建了指令微调数据集EgoPlan-IT。实验结果表明，经过EgoPlan-IT微调的模型不仅在本基准测试上表现显著提升，还能在仿真环境中有效充当具身规划器。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日