Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.
翻译:近年来,多模态大语言模型(MLLMs)在各种任务中取得了卓越的性能,其能力不断超越先前的预期。然而,它们在从图像中感知情感方面的能力仍存在争议,相关研究在零样本场景下得出了不同的结果。我们认为,这种不一致性部分源于现有评估方法的局限性,包括对合理响应的忽视、有限的情感分类体系、对上下文因素的忽略以及劳动密集型的标注工作。为了促进针对MLLMs的定制化视觉情感评估,我们提出了一项能够克服这些局限性的情感陈述判断任务。作为该任务的补充,我们设计了一个自动化流程,能够以最小的人力投入高效构建以情感为中心的陈述。通过对主流MLLMs进行系统评估,我们的研究展示了它们在情感解释和基于上下文的情感判断方面更强的性能,同时也揭示了它们在理解感知主观性方面的相对局限性。与人类相比,即使是像GPT4o这样的顶级MLLMs也表现出显著的性能差距,这指明了未来改进的关键方向。通过开发一个基础的评估框架并进行全面的MLLM评估,我们希望这项工作能为推进MLLMs的情感智能发展做出贡献。项目页面:https://github.com/wdqqdw/MVEI。