Multimodal emotion recognition (MER) aims to detect the emotional status of a given expression by combining the speech and text information. Intuitively, label information should be capable of helping the model locate the salient tokens/frames relevant to the specific emotion, which finally facilitates the MER task. Inspired by this, we propose a novel approach for MER by leveraging label information. Specifically, we first obtain the representative label embeddings for both text and speech modalities, then learn the label-enhanced text/speech representations for each utterance via label-token and label-frame interactions. Finally, we devise a novel label-guided attentive fusion module to fuse the label-aware text and speech representations for emotion classification. Extensive experiments were conducted on the public IEMOCAP dataset, and experimental results demonstrate that our proposed approach outperforms existing baselines and achieves new state-of-the-art performance.
翻译:多模态情感识别(MER)旨在通过结合语音和文本信息,检测给定表达的情感状态。直观上,标签信息应能帮助模型定位与特定情感相关的显著标记/帧,从而最终促进MER任务。受此启发,我们提出了一种利用标签信息进行MER的新方法。具体而言,我们首先获取文本和语音模态的代表性标签嵌入,然后通过标签-标记和标签-帧交互学习每个话语的标签增强文本/语音表示。最后,我们设计了一种新颖的标签引导注意力融合模块,以融合具有标签意识的文本和语音表示进行情感分类。在公开的IEMOCAP数据集上进行了大量实验,实验结果表明,我们提出的方法优于现有基线,并取得了新的最先进性能。