Emotion recognition in conversations (ERC), the task of recognizing the emotion of each utterance in a conversation, is crucial for building empathetic machines. Existing studies focus mainly on capturing context- and speaker-sensitive dependencies on the textual modality but ignore the significance of multimodal information. Different from emotion recognition in textual conversations, capturing intra- and inter-modal interactions between utterances, learning weights between different modalities, and enhancing modal representations play important roles in multimodal ERC. In this paper, we propose a transformer-based model with self-distillation (SDT) for the task. The transformer-based model captures intra- and inter-modal interactions by utilizing intra- and inter-modal transformers, and learns weights between modalities dynamically by designing a hierarchical gated fusion strategy. Furthermore, to learn more expressive modal representations, we treat soft labels of the proposed model as extra training supervision. Specifically, we introduce self-distillation to transfer knowledge of hard and soft labels from the proposed model to each modality. Experiments on IEMOCAP and MELD datasets demonstrate that SDT outperforms previous state-of-the-art baselines.
翻译:对话情感识别(ERC)旨在识别对话中每条话语的情感,是构建共情机器的关键任务。现有研究主要关注文本模态中捕获上下文和说话者敏感依赖关系,但忽视了多模态信息的重要性。与文本对话情感识别不同,捕获话语间模态内与模态间交互、学习不同模态间的权重以及增强模态表征对多模态ERC至关重要。本文提出一种基于Transformer的自蒸馏模型(SDT)以解决该任务。该模型通过模态内与模态间Transformer捕获模态内与模态间交互,并设计分层门控融合策略动态学习模态间权重。此外,为学习更具表达力的模态表征,我们将所提模型的软标签作为额外训练监督信号,引入自蒸馏技术将硬标签与软标签知识从整体模型迁移至各模态。在IEMOCAP与MELD数据集上的实验表明,SDT优于现有最优基线模型。