Explainable Multimodal Emotion Recognition (EMER) is an emerging task that aims to achieve reliable and accurate emotion recognition. However, due to the high annotation cost, the existing dataset (denoted as EMER-Fine) is small, making it difficult to perform supervised training. To reduce the annotation cost and expand the dataset size, this paper reviews the previous dataset construction process. Then, we simplify the annotation pipeline, avoid manual checks, and replace the closed-source models with open-source models. Finally, we build \textbf{EMER-Coarse}, a coarsely-labeled dataset containing large-scale samples. Besides the dataset, we propose a two-stage training framework \textbf{AffectGPT}. The first stage exploits EMER-Coarse to learn a coarse mapping between multimodal inputs and emotion-related descriptions; the second stage uses EMER-Fine to better align with manually-checked results. Experimental results demonstrate the effectiveness of our proposed method on the challenging EMER task. To facilitate further research, we will make the code and dataset available at: https://github.com/zeroQiaoba/AffectGPT.
翻译:可解释多模态情感识别(EMER)是一项新兴任务,旨在实现可靠且准确的情感识别。然而,由于标注成本高昂,现有数据集(记为 EMER-Fine)规模较小,难以进行监督训练。为降低标注成本并扩大数据集规模,本文回顾了先前的数据集构建流程。随后,我们简化了标注流程,避免了人工核查,并使用开源模型替代了闭源模型。最终,我们构建了 **EMER-Coarse**——一个包含大规模样本的粗标注数据集。除数据集外,我们提出了一个两阶段训练框架 **AffectGPT**。第一阶段利用 EMER-Coarse 学习多模态输入与情感相关描述之间的粗粒度映射;第二阶段使用 EMER-Fine 以更好地与人工核查结果对齐。实验结果证明了我们提出的方法在具有挑战性的 EMER 任务上的有效性。为促进进一步研究,我们将在以下网址公开代码与数据集:https://github.com/zeroQiaoba/AffectGPT。