Explainable Multimodal Emotion Reasoning

Multimodal emotion recognition is an active research topic in artificial intelligence. Its primary objective is to integrate multi-modalities (such as acoustic, visual, and lexical clues) to identify human emotional states. Current works generally assume accurate emotion labels for benchmark datasets and focus on developing more effective architectures. But due to the inherent subjectivity of emotions, existing datasets often lack high annotation consistency, resulting in potentially inaccurate labels. Consequently, models built on these datasets may struggle to meet the demands of practical applications. To address this issue, it is crucial to enhance the reliability of emotion annotations. In this paper, we propose a novel task called ``\textbf{Explainable Multimodal Emotion Reasoning (EMER)}''. In contrast to previous works that primarily focus on predicting emotions, EMER takes a step further by providing explanations for these predictions. The prediction is considered correct as long as the reasoning process behind the predicted emotion is plausible. This paper presents our initial efforts on EMER, where we introduce a benchmark dataset, establish baseline models, and define evaluation metrics. Meanwhile, we observe the necessity of integrating multi-faceted capabilities to deal with EMER. Therefore, we propose the first multimodal large language model (LLM) in affective computing, called \textbf{AffectGPT}. We aim to tackle the long-standing challenge of label ambiguity and chart a path toward more reliable techniques. Furthermore, EMER offers an opportunity to evaluate the audio-video-text understanding capabilities of recent multimodal LLM. To facilitate further research, we make the code and data available at: https://github.com/zeroQiaoba/AffectGPT.

翻译：多模态情感识别是人工智能中一个活跃的研究课题，其主要目标是通过整合多种模态（如声学、视觉和词汇线索）来识别人类情感状态。当前的研究通常假设基准数据集具有准确的情感标签，并侧重于开发更有效的架构。然而，由于情感固有的主观性，现有数据集往往缺乏高标注一致性，导致标签可能不准确。因此，基于这些数据集构建的模型可能难以满足实际应用的需求。为解决这一问题，增强情感标注的可靠性至关重要。本文提出了一项新任务，称为“**可解释的多模态情感推理（EMER）**”。与以往主要关注情感预测的工作不同，EMER更进一步，为这些预测提供推理解释。只要预测情感背后的推理过程合理，该预测即被视为正确。本文介绍了我们在EMER方面的初步工作，包括引入基准数据集、建立基线模型以及定义评估指标。同时，我们观察到需要整合多方面能力来处理EMER。为此，我们提出了情感计算领域首个多模态大语言模型（LLM），称为**AffectGPT**。我们旨在应对标签模糊性这一长期挑战，并开拓通往更可靠技术的路径。此外，EMER为评估近期多模态LLM的音频-视频-文本理解能力提供了机会。为促进进一步研究，我们在以下网址公开了代码和数据：https://github.com/zeroQiaoba/AffectGPT。