Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated remarkable advances in perception and reasoning, suggesting their potential for embodied intelligence. While recent studies have evaluated embodied MLLMs in interactive settings, current benchmarks mainly target capabilities to perceive, understand, and interact with external objects, lacking a systematic evaluation of self-centric intelligence. To address this, we introduce MirrorBench, a simulation-based benchmark inspired by the classical Mirror Self-Recognition (MSR) test in psychology. MirrorBench extends this paradigm to embodied MLLMs through a tiered framework of progressively challenging tasks, assessing agents from basic visual perception to high-level self-representation. Experiments on leading MLLMs show that even at the lowest level, their performance remains substantially inferior to human performance, revealing fundamental limitations in self-referential understanding. Our study bridges psychological paradigms and embodied intelligence, offering a principled framework for evaluating the emergence of general intelligence in large models. Project page: https://fflahm.github.io/mirror-bench-page/.
翻译:近年来,多模态大语言模型(MLLMs)在感知与推理方面取得了显著进展,展现出具身智能的潜力。尽管已有研究在交互场景下对具身MLLMs进行评估,但现有基准主要针对感知、理解及与外部物体的交互能力,缺乏对自我中心智能的系统性评估。为此,我们提出MirrorBench——一个受心理学经典镜子自我识别(MSR)测试启发的仿真基准。MirrorBench通过渐进式任务层级框架将这一范式扩展至具身MLLMs,从基础视觉感知到高级自我表征逐层评估智能体。对主流MLLMs的实验表明,即使在最低层级,其性能仍显著低于人类表现,揭示了自我参照理解的根本性局限。本研究将心理学范式与具身智能相连接,为评估大模型通用智能的涌现提供了原则性框架。项目主页:https://fflahm.github.io/mirror-bench-page/。