MME-Emotion：面向多模态大语言模型情感智能的综合性评估基准 (MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models)

Fan Zhang,Zebang Cheng,Chong Deng,Haoxuan Li,Zheng Lian,Qian Chen,Huadai Liu,Wen Wang,Yi-Fan Zhang,Renrui Zhang,Ziyu Guo,Zhihong Zhu,Hao Wu,Haixin Wang,Yefeng Zheng,Xiaojiang Peng,Xian Wu,Kun Wang,Xiangang Li,Jieping Ye,Pheng-Ann Heng

Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only $39.3\%$ recognition score and $56.0\%$ Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs' emotional intelligence in the future.

翻译：多模态大语言模型（MLLMs）的最新进展催化了情感计算领域的变革性进步，使模型能够展现出涌现的情感智能。尽管方法学上取得了实质性进展，当前的情感评估基准仍然存在局限，因为尚不清楚：（a）MLLMs在不同场景下的泛化能力，以及（b）它们识别情绪状态背后触发因素的推理能力。为弥补这些不足，我们提出了**MME-Emotion**，一个系统性的基准测试，用于评估MLLMs的情感理解与推理能力，该基准具备*可扩展容量*、*多样化设置*和*统一协议*。作为目前最大的MLLM情感智能基准，MME-Emotion包含了超过6,000个精选视频片段及任务特定的问答对，涵盖广泛场景以构建八项情感任务。它进一步整合了一个包含混合评估指标的综合评估套件，用于情感识别与推理，并通过多智能体系统框架进行分析。通过对20个先进MLLM的严格评估，我们揭示了它们的优势与局限，并得出若干关键发现：① 当前MLLMs的情感智能表现不尽如人意，性能最佳的模型在我们的基准上仅获得$39.3\%$的识别分数和$56.0\%$的思维链（CoT）分数。② 通用模型（*例如*，Gemini-2.5-Pro）的情感智能源于其泛化的多模态理解能力，而专用模型（*例如*，R1-Omni）则可通过领域特定的后训练适应达到可比性能。通过引入MME-Emotion，我们希望它能作为未来推动MLLM情感智能发展的基础。