In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed light on their capabilities and limitations in this domain. To this end, based on the characteristics of embodied task planning, we first develop a systematic evaluation framework, which encapsulates four crucial capabilities of MFMs: object understanding, spatio-temporal perception, task understanding, and embodied reasoning. Following this, we propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios, typical yet diverse task types, task instances of varying difficulties, and rich test case types ranging from multiple embodied question answering to embodied task reasoning. Finally, we offer a simple and easy-to-use automatic evaluation platform that enables the automated testing of multiple MFMs on the proposed benchmark. Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance. The MFE-ETP is a high-quality, large-scale, and challenging benchmark relevant to real-world tasks.
翻译:近年来,多模态基础模型与具身人工智能正以前所未有的速度并肩发展。两者的融合已引起人工智能研究界的广泛关注。在本工作中,我们试图对多模态基础模型在具身任务规划上的性能进行深入而全面的评估,旨在揭示其在该领域的能力与局限。为此,基于具身任务规划的特点,我们首先构建了一个系统化的评估框架,该框架囊括了多模态基础模型的四项关键能力:物体理解、时空感知、任务理解与具身推理。在此基础上,我们提出了一个名为MFE-ETP的新基准,其特点在于复杂多变的任务场景、典型且多样化的任务类型、不同难度的任务实例,以及从多轮具身问答到具身任务推理的丰富测试类型。最后,我们提供了一个简单易用的自动化评估平台,能够支持在提出的基准上对多个多模态基础模型进行自动化测试。利用该基准与评估平台,我们对多个前沿多模态基础模型进行了评估,发现其性能显著落后于人类水平。MFE-ETP是一个高质量、大规模且与现实世界任务相关的挑战性基准。