Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding

Xiaojiang Peng,Jingyi Chen,Zebang Cheng,Bao Peng,Fengyi Wu,Yifei Dong,Shuyuan Tu,Qiyu Hu,Huiting Huang,Yuxiang Lin,Jun-Yan He,Kai Wang,Zheng Lian,Zhi-Qi Cheng

Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.

翻译：从多模态信号中理解人类情感是情感计算与人机交互领域的一项重大挑战。尽管多模态大语言模型在通用视觉-语言任务中表现出色，但其在情感推理方面的能力仍然有限。当前该领域面临高质量、描述性情感标注的大规模数据集稀缺，且缺乏标准化的评估基准。我们先前提出的框架Emotion-LLaMA开创了面向情感推理的指令调优多模态学习，但受限于显式人脸检测器、隐式融合策略以及规模有限、质量不高的训练数据。为克服这些局限性，我们提出了Emotion-LLaMAv2与MMEVerse基准，为情感识别与推理建立了一个端到端的流程及标准化的评估设置。Emotion-LLaMAv2引入了三项关键改进。首先，一个端到端的多视角编码器消除了对外部人脸检测的依赖，并通过更丰富的空间与时间多视角令牌捕捉细微的情感线索。其次，设计了一个Conv Attention预融合模块，使其能在LLM骨干网络之外实现局部与全局多模态特征的同步交互。第三，在LLaMA2骨干网络中采用了一种从感知到认知的课程式指令调优方案，统一了情感识别与自由形式的情感推理。为支持大规模训练与可复现的评估，MMEVerse将十二个公开可用的情感数据集（包括IEMOCAP、MELD、DFEW和MAFW）整合为统一的多模态指令格式。这些数据通过一个涉及Qwen2 Audio、Qwen2.5 VL和GPT 4o的多智能体流程进行了重新标注，生成了涵盖18个评估基准的13万条训练片段和3.6万条测试片段。