The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs. A crucial milestone in the evolution of AGI is the attainment of human-level planning, a fundamental ability for making informed decisions in complex environments, and solving a wide range of real-world problems. Despite the impressive advancements in MLLMs, a question remains: How far are current MLLMs from achieving human-level planning? To shed light on this question, we introduce EgoPlan-Bench, a comprehensive benchmark to evaluate the planning abilities of MLLMs in real-world scenarios from an egocentric perspective, mirroring human perception. EgoPlan-Bench emphasizes the evaluation of planning capabilities of MLLMs, featuring realistic tasks, diverse action plans, and intricate visual observations. Our rigorous evaluation of a wide range of MLLMs reveals that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. To facilitate this advancement, we further present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench. We have made all codes, data, and a maintained benchmark leaderboard available to advance future research.
翻译:多模态大语言模型(MLLMs)凭借其卓越的推理、泛化能力以及处理多模态输入的熟练度,加速了通用人工智能(AGI)的探索进程。实现人类级规划是AGI演进的关键里程碑,这一基础能力使模型能在复杂环境中做出明智决策,并解决广泛的实际问题。尽管MLLMs取得了令人瞩目的进展,但仍存在一个核心问题:当前MLLMs距离实现人类级规划还有多远?为解答这一问题,我们提出了EgoPlan-Bench——一个从第一人称视角(模拟人类感知)评估MLLMs在真实场景中规划能力的综合性基准。该基准专注于评估MLLMs的规划能力,包含逼真的任务场景、多样化的行动方案以及复杂的视觉观察。我们对多种MLLMs的严格评估表明,EgoPlan-Bench构成了重大挑战,揭示了MLLMs在实现人类级任务规划方面仍存在巨大改进空间。为促进这一进步,我们进一步提出了EgoPlan-IT——一个专用指令微调数据集,可有效提升模型在EgoPlan-Bench上的性能。我们已公开所有代码、数据及持续更新的基准排行榜,以推动未来研究。