Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks, but their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. Thus, it impedes their ability to effectively understand and react to the intricate emotions expressed by humans through multimodal media. To bridge this gap, we introduce EmoBench, the first comprehensive benchmark designed specifically to evaluate the emotional capabilities of MLLMs across five popular emotional tasks, using a diverse dataset of 287k images and videos paired with corresponding textual instructions. Meanwhile, we propose EmoLLM, a novel model for multimodal emotional understanding, incorporating with two core techniques. 1) Multi-perspective Visual Projection, it captures diverse emotional cues from visual data from multiple perspectives. 2) EmoPrompt, it guides MLLMs to reason about emotions in the correct direction. Experimental results demonstrate that EmoLLM significantly elevates multimodal emotional understanding performance, with an average improvement of 12.1% across multiple foundation models on EmoBench. Our work contributes to the advancement of MLLMs by facilitating a deeper and more nuanced comprehension of intricate human emotions, paving the way for the development of artificial emotional intelligence capabilities with wide-ranging applications in areas such as human-computer interaction, mental health support, and empathetic AI systems. Code, data, and model will be released.
翻译:多模态大型语言模型(MLLMs)在客观多模态感知任务上已取得显著性能,但其解释主观、情感细腻的多模态内容的能力在很大程度上仍未得到探索。这阻碍了它们有效理解和响应人类通过多模态媒体表达的复杂情感的能力。为弥合这一差距,我们引入了EmoBench,这是首个专门设计用于评估MLLMs在五种流行情感任务上情感能力的综合性基准,使用了包含28.7万张图像和视频及其对应文本指令的多样化数据集。同时,我们提出了EmoLLM,一种用于多模态情感理解的新型模型,融合了两项核心技术。1)多视角视觉投影:从多个视角捕捉视觉数据中的多样化情感线索。2)EmoPrompt:引导MLLMs沿正确方向进行情感推理。实验结果表明,EmoLLM显著提升了多模态情感理解性能,在EmoBench上相对于多个基础模型平均提升了12.1%。我们的工作通过促进对复杂人类情感更深入、更细腻的理解,推动了MLLMs的进步,为发展具有广泛适用性的人工情感智能能力铺平了道路,其应用领域包括人机交互、心理健康支持和共情AI系统等。代码、数据和模型将公开发布。