The lack of data and the difficulty of multimodal fusion have always been challenges for multimodal emotion recognition (MER). In this paper, we propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality, and finetune them in downstream task of MER to cope with the lack of data. For the difficulty of multimodal fusion, we use a K-layer multi-head attention mechanism as a downstream fusion module. Starting from the MER task itself, we design two auxiliary tasks to alleviate the insufficient fusion between modalities and guide the network to capture and align emotion-related features. Compared to the previous state-of-the-art models, we achieve a better performance by 78.42% Weighted Accuracy (WA) and 79.71% Unweighted Accuracy (UA) on the IEMOCAP dataset.
翻译:数据匮乏与多模态融合困难始终是多模态情感识别(MER)面临的挑战。本文提出使用预训练模型作为上游网络——音频模态采用wav2vec 2.0,文本模态采用BERT,并在MER下游任务中进行微调以应对数据不足问题。针对多模态融合难题,我们采用K层多头注意力机制作为下游融合模块。从MER任务本身出发,我们设计了两项辅助任务来缓解模态间融合不充分的问题,并引导网络捕捉和对齐情感相关特征。与先前最优模型相比,我们在IEMOCAP数据集上取得了更优性能,加权准确率(WA)达78.42%,未加权准确率(UA)达79.71%。