XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.

翻译：可解释多模态情感识别在人机交互与社交媒体分析等应用中具有关键作用。然而，现有方法在线索级感知与推理方面面临两大挑战：1）通用模态编码器通常预训练用于捕捉全局结构与通用语义，而非细粒度情感线索，导致对情感信号的敏感性有限；2）现有数据集往往在标注质量与规模间存在权衡，导致情感线索的监督信息不足，最终限制了线索级推理能力。此外，现有评估指标难以有效衡量线索级推理性能。为应对这些挑战，我们提出可解释情感GPT（XEmoGPT），一种能够感知并推理情感线索的新型可解释多模态情感识别框架。该框架包含两个专用模块：视频情感线索桥接器（VECB）与音频情感线索桥接器（AECB），它们通过精心设计的细粒度情感线索感知任务增强视频与音频编码器。为进一步支持线索级推理，我们构建了大规模数据集EmoCue，用于指导XEmoGPT学习多模态情感线索的推理方法。此外，我们提出了基于语义相似度的情感线索自动匹配评估指标EmoCue-360，并发布了包含400个专家标注样本、覆盖多样化情感场景的基准测试集EmoCue-Eval。实验结果表明，XEmoGPT在情感线索感知与推理方面均表现出优异性能。