Multimodal emotion recognition is an active research topic in the field of artificial intelligence. It aims to integrate multimodal clues (including acoustic, visual, and lexical clues) and recognize human emotional states from these clues. Current works generally assume correct emotion labels for benchmark datasets and focus on building more effective architectures to achieve better performance. But due to the ambiguity and subjectivity of emotion, existing datasets cannot achieve high annotation consistency (i.e., labels may be inaccurate), making it difficult for models developed on these datasets to meet the demand of practical applications. To address this problem, the core is to improve the reliability of emotion annotations. Therefore, we propose a new task called ``Explainable Multimodal Emotion Reasoning (EMER)''. Unlike previous works that only predict emotional states, EMER further explains the reasons behind these predictions to enhance their reliability. In this task, rationality is the only evaluation metric. As long as the emotional reasoning process for a given video is plausible, the prediction is correct. In this paper, we make an initial attempt at this task and establish a benchmark dataset, baselines, and evaluation metrics. We aim to address the long-standing problem of label ambiguity and point a way to the next-generation affective computing techniques. In addition, EMER can also be exploited to evaluate the audio-video-text understanding ability of recent multimodal large language models. Code and data: https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning.
翻译:多模态情感识别是人工智能领域中的一个活跃研究课题。它旨在整合多模态线索(包括声学、视觉和词汇线索),并从这些线索中识别人类情感状态。现有工作通常假定基准数据集具有正确的情感标签,并侧重于构建更有效的模型架构以获得更好的性能。然而,由于情感的模糊性和主观性,现有数据集无法实现高标注一致性(即标签可能不准确),这使得基于这些数据集开发的模型难以满足实际应用的需求。为解决这一问题,核心在于提高情感标注的可靠性。因此,我们提出一项名为“可解释的多模态情感推理(EMER)”的新任务。与以往仅预测情感状态的工作不同,EMER进一步解释这些预测背后的原因,以增强其可靠性。在该任务中,合理性是唯一的评估指标。只要针对给定视频的情感推理过程合理,预测即被视为正确。本文对该任务进行了初步尝试,并建立了基准数据集、基线模型和评估指标。我们旨在解决长期存在的标签模糊性问题,并为下一代情感计算技术指明方向。此外,EMER还可用于评估近期多模态大语言模型在音频-视频-文本理解方面的能力。代码和数据:https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning。