MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning

In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed light on their capabilities and limitations in this domain. To this end, based on the characteristics of embodied task planning, we first develop a systematic evaluation framework, which encapsulates four crucial capabilities of MFMs: object understanding, spatio-temporal perception, task understanding, and embodied reasoning. Following this, we propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios, typical yet diverse task types, task instances of varying difficulties, and rich test case types ranging from multiple embodied question answering to embodied task reasoning. Finally, we offer a simple and easy-to-use automatic evaluation platform that enables the automated testing of multiple MFMs on the proposed benchmark. Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance. The MFE-ETP is a high-quality, large-scale, and challenging benchmark relevant to real-world tasks.

翻译：近年来，多模态基础模型与具身人工智能正以前所未有的速度并肩发展。两者的融合已引起人工智能研究界的广泛关注。在本工作中，我们试图对多模态基础模型在具身任务规划上的性能进行深入而全面的评估，旨在揭示其在该领域的能力与局限。为此，基于具身任务规划的特点，我们首先构建了一个系统化的评估框架，该框架囊括了多模态基础模型的四项关键能力：物体理解、时空感知、任务理解与具身推理。在此基础上，我们提出了一个名为MFE-ETP的新基准，其特点在于复杂多变的任务场景、典型且多样化的任务类型、不同难度的任务实例，以及从多轮具身问答到具身任务推理的丰富测试类型。最后，我们提供了一个简单易用的自动化评估平台，能够支持在提出的基准上对多个多模态基础模型进行自动化测试。利用该基准与评估平台，我们对多个前沿多模态基础模型进行了评估，发现其性能显著落后于人类水平。MFE-ETP是一个高质量、大规模且与现实世界任务相关的挑战性基准。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日