Explainable Multimodal Emotion Reasoning

Multimodal emotion recognition is an active research topic in artificial intelligence. Its main goal is to integrate multi-modalities to identify human emotional states. Current works generally assume accurate emotion labels for benchmark datasets and focus on developing more effective architectures. However, emotions have inherent ambiguity and subjectivity. To obtain more reliable labels, existing datasets usually restrict the label space to some basic categories, then hire multiple annotators and use majority voting to select the most likely label. However, this process may cause some correct but non-candidate or non-majority labels to be ignored. To improve reliability without ignoring subtle emotions, we propose a new task called "Explainable Multimodal Emotion Reasoning (EMER)". In contrast to traditional tasks that focus on predicting emotions, EMER takes a step further by providing explanations for these predictions. Through this task, we can extract more reliable labels since each label has a certain basis. Meanwhile, we use LLMs to disambiguate unimodal descriptions and generate more complete multimodal EMER descriptions. From them, we can extract more subtle labels, providing a promising approach for open-vocabulary emotion recognition. This paper presents our initial efforts, where we introduce a new dataset, establish baselines, and define evaluation metrics. In addition, EMER can also be used as a benchmark dataset to evaluate the audio-video-text understanding capabilities of multimodal LLMs. To facilitate further research, we will make the code and data available at: https://github.com/zeroQiaoba/AffectGPT.

翻译：多模态情感识别是人工智能中的一个活跃研究课题，其主要目标是通过整合多模态信息来识别人类情感状态。当前研究通常假设基准数据集具有准确的情感标签，并专注于开发更有效的架构。然而，情感本身具有固有的模糊性和主观性。为获取更可靠的标签，现有数据集通常将标签空间限制为某些基本类别，并雇佣多名标注者通过多数投票选择最可能的标签。但这一过程可能导致某些正确但非候选或非多数标签被忽略。为提高可靠性且不忽略细微情感，我们提出一项新任务——“可解释多模态情感推理（EMER）”。与传统任务侧重于预测情感不同，EMER进一步为这些预测提供解释。通过该任务，我们能提取更可靠的标签，因为每个标签都有一定依据。同时，我们利用大语言模型（LLMs）消解单模态描述的歧义，并生成更完整的多模态EMER描述。从中可提取更细微的标签，为开放词汇情感识别提供一种有前景的方法。本文展示了我们的初步工作，包括引入新数据集、建立基线及定义评估指标。此外，EMER还可作为基准数据集，用于评估多模态大语言模型的音视频-文本理解能力。为促进后续研究，我们将公开代码与数据，链接为：https://github.com/zeroQiaoba/AffectGPT。