Emotion Recognition in Conversation (ERC) plays an important role in driving the development of human-machine interaction. Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal information fusion process, and (2) the prediction problem of less sample emotion labels that are semantically similar but different categories. To address these issues and fully utilize the features of each modality, we adopted the following strategies: first, deep emotion cues extraction was performed on modalities with strong representation ability, and feature filters were designed as multimodal prompt information for modalities with weak representation ability. Then, we designed a Multimodal Prompt Transformer (MPT) to perform cross-modal information fusion. MPT embeds multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in encoding textual features and being fused with multi-level textual information to obtain better multimodal fusion features. Finally, we used the Hybrid Contrastive Learning (HCL) strategy to optimize the model's ability to handle labels with few samples. This strategy uses unsupervised contrastive learning to improve the representation ability of multimodal fusion and supervised contrastive learning to mine the information of labels with few samples. Experimental results show that our proposed model outperforms state-of-the-art models in ERC on two benchmark datasets.
翻译:对话情感识别(ERC)在人机交互发展中扮演着重要角色。情感可以存在于多种模态中,多模态ERC主要面临两个问题:(1)跨模态信息融合过程中的噪声问题;(2)语义相似但类别不同的少样本情感标签的预测问题。为解决这些问题并充分利用各模态特征,我们采用以下策略:首先,对表达能力强的模态进行深度情感线索提取,并为表达能力弱的模态设计特征滤波器作为多模态提示信息。然后,我们设计了一个多模态提示变压器(MPT)进行跨模态信息融合。MPT将多模态融合信息嵌入到Transformer的每个注意力层中,使提示信息能够参与文本特征的编码并与多层级文本信息融合,从而获得更优的多模态融合特征。最后,我们采用混合对比学习(HCL)策略优化模型处理少样本标签的能力。该策略利用无监督对比学习提升多模态融合的表征能力,利用有监督对比学习挖掘少样本标签的信息。实验结果表明,我们提出的模型在两个基准数据集上均优于现有最优模型。