EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

The advent of Multimodal Large Language Models, leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence. However, achieving AGI necessitates more than just comprehension and reasoning. A crucial capability required is effective planning in diverse scenarios, which involves making reasonable decisions based on complex environments to solve real-world problems. Despite its importance, the planning abilities of current MLLMs in varied scenarios remain underexplored. In this paper, we introduce EgoPlan-Bench2, a rigorous and comprehensive benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. EgoPlan-Bench2 encompasses everyday tasks spanning 4 major domains and 24 detailed scenarios, closely aligned with human daily life. EgoPlan-Bench2 is constructed through a semi-automatic process utilizing egocentric videos, complemented by manual verification. Grounded in a first-person perspective, it mirrors the way humans approach problem-solving in everyday life. We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning. To further improve the planning proficiency of current MLLMs, we propose a training-free approach using multimodal Chain-of-Thought (CoT) prompting through investigating the effectiveness of various multimodal prompts in complex planning. Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training. Our work not only sheds light on the current limitations of MLLMs in planning, but also provides insights for future enhancements in this critical area. We have made data and code available at https://qiulu66.github.io/egoplanbench2/.

翻译：多模态大语言模型凭借大语言模型的核心能力，近期展现出卓越的多模态理解与推理性能，标志着通用人工智能新时代的到来。然而，实现通用人工智能不仅需要理解与推理能力。一项关键的必要能力是在多样化场景中进行有效规划，即依据复杂环境做出合理决策以解决现实问题。尽管该能力至关重要，当前多模态大语言模型在不同场景下的规划能力仍未得到充分探索。本文提出EgoPlan-Bench2——一个严谨而全面的评测基准，旨在系统评估多模态大语言模型在广泛真实世界场景中的规划能力。EgoPlan-Bench2涵盖4个主要领域与24个精细场景的日常任务，紧密贴合人类日常生活。该基准通过利用第一视角视频的半自动化流程构建，并辅以人工验证。基于第一人称视角的设计，其模拟了人类在日常生活中解决问题的典型方式。我们评估了21个具有竞争力的多模态大语言模型，并深入分析了其局限性，揭示出它们在真实世界规划任务中面临显著挑战。为进一步提升现有模型的规划能力，我们提出一种免训练方法，通过探究多种多模态提示在复杂规划任务中的有效性，采用多模态思维链提示策略。该方法使GPT-4V在EgoPlan-Bench2上的性能提升了10.24分，且无需额外训练。本研究不仅揭示了当前多模态大语言模型在规划能力上的局限，也为该关键领域的未来改进提供了洞见。相关数据与代码已公开于https://qiulu66.github.io/egoplanbench2/。