Explainable Multimodal Emotion Reasoning

Multimodal emotion recognition is an active research topic in artificial intelligence. Its primary objective is to integrate multi-modalities (such as acoustic, visual, and lexical clues) to identify human emotional states. Current works generally assume accurate emotion labels for benchmark datasets and focus on developing more effective architectures. But due to the inherent subjectivity of emotions, existing datasets often lack high annotation consistency, resulting in potentially inaccurate labels. Consequently, models built on these datasets may struggle to meet the demands of practical applications. To address this issue, it is crucial to enhance the reliability of emotion annotations. In this paper, we propose a novel task called ``Explainable Multimodal Emotion Reasoning (EMER)''. In contrast to previous works that primarily focus on predicting emotions, EMER takes a step further by providing explanations for these predictions. The prediction is considered correct as long as the reasoning process behind the predicted emotion is plausible. This paper presents our initial efforts on EMER, where we introduce a benchmark dataset, establish baseline models, and define evaluation metrics. We aim to tackle the long-standing challenge of label ambiguity and chart a path toward more reliable affective computing techniques. Furthermore, EMER offers an opportunity to evaluate the audio-video-text understanding capabilities of recent multimodal large language models. To facilitate further research, we make the code and data available at: https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning.

翻译：多模态情感识别是人工智能领域的一个活跃研究方向，其主要目标是通过整合多模态信息（如声学、视觉和词汇线索）来识别人类情绪状态。现有研究通常假设基准数据集标注准确，并专注于开发更有效的模型架构。然而，由于情感本身固有的主观性，现有数据集往往缺乏较高的标注一致性，导致标签可能存在不准确性。因此，基于这些数据集构建的模型可能难以满足实际应用需求。为解决这一问题，提升情感标注的可靠性至关重要。本文提出了一项名为“可解释多模态情感推理（EMER）”的新任务。与以往主要关注情感预测的研究不同，EMER在前者基础上更进一步，为预测结果提供解释。只要预测情绪背后的推理过程合理，该预测即被视为正确。本文展示了我们在EMER方面的初步工作，包括构建基准数据集、建立基线模型并定义评估指标。我们旨在解决长期存在的标签模糊性挑战，为更可靠的情感计算技术开辟路径。此外，EMER为评估近期多模态大语言模型在音视频文本理解能力方面提供了契机。为促进后续研究，我们已将代码和数据开源至：https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning。