Multimodal emotion recognition is an important research topic in artificial intelligence, whose main goal is to integrate multimodal clues to identify human emotional states. Current works generally assume accurate labels for benchmark datasets and focus on developing more effective architectures. However, emotion annotation relies on subjective judgment. To obtain more reliable labels, existing datasets usually restrict the label space to some basic categories, then hire plenty of annotators and use majority voting to select the most likely label. However, this process may result in some correct but non-candidate or non-majority labels being ignored. To ensure reliability without ignoring subtle emotions, we propose a new task called ``Explainable Multimodal Emotion Recognition (EMER)''. Unlike traditional emotion recognition, EMER takes a step further by providing explanations for these predictions. Through this task, we can extract relatively reliable labels since each label has a certain basis. Meanwhile, we borrow large language models (LLMs) to disambiguate unimodal clues and generate more complete multimodal explanations. From them, we can extract richer emotions in an open-vocabulary manner. This paper presents our initial attempt at this task, including introducing a new dataset, establishing baselines, and defining evaluation metrics. In addition, EMER can serve as a benchmark task to evaluate the audio-video-text understanding performance of multimodal LLMs.
翻译:多模态情感识别是人工智能领域的重要研究课题,其主要目标是通过整合多模态线索来识别人类情感状态。现有研究通常默认基准数据集具有准确标注,并侧重于开发更有效的架构。然而,情感标注依赖于主观判断。为获得更可靠的标签,现有数据集通常将标签空间限制于若干基本类别,随后雇佣大量标注者并采用多数投票方式选择最可能的标签。但这一过程可能导致某些正确但非候选或非多数的标签被忽略。为确保可靠性同时不忽略细微情感,我们提出名为"可解释多模态情感识别(EMER)"的新任务。与传统情感识别不同,EMER通过为预测结果提供解释实现了进一步突破。通过该任务,由于每个标签均具备一定依据,我们能够提取相对可靠的标注。同时,我们借助大语言模型(LLMs)消除单模态线索歧义,并生成更完整的多模态解释。基于这些解释,我们可以通过开放词汇方式提取更丰富的情感类别。本文展示了该任务的初步探索,包括引入新数据集、建立基线模型以及定义评估指标。此外,EMER可作为基准任务来评估多模态大语言模型在音频-视频-文本理解方面的性能。