The main task of Multimodal Emotion Recognition in Conversations (MERC) is to identify the emotions in modalities, e.g., text, audio, image and video, which is a significant development direction for realizing machine intelligence. However, many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition. To tackle this problem, we systematically analyze it from three aspects: data augmentation, loss sensitivity, and sampling strategy, and propose the Class Boundary Enhanced Representation Learning (CBERL) model. Concretely, we first design a multimodal generative adversarial network to address the imbalanced distribution of {emotion} categories in raw data. Secondly, a deep joint variational autoencoder is proposed to fuse complementary semantic information across modalities and obtain discriminative feature representations. Finally, we implement a multi-task graph neural network with mask reconstruction and classification optimization to solve the problem of overfitting and underfitting in class boundary learning, and achieve cross-modal emotion recognition. We have conducted extensive experiments on the IEMOCAP and MELD benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition. Especially on the minority class fear and disgust emotion labels, our model improves the accuracy and F1 value by 10% to 20%.
翻译:对话情境下的多模态情感识别(MERC)主要任务在于识别文本、音频、图像及视频等模态中的情感,这是实现机器智能的重要发展方向。然而,MERC中的大量数据天然存在情感类别分布不均衡的问题,而研究人员往往忽视了不平衡数据对情感识别的负面影响。为解决该问题,我们从数据增强、损失敏感性与采样策略三个维度进行系统分析,并提出类别边界增强表示学习模型(CBERL)。具体而言,我们首先设计了一个多模态生成对抗网络以处理原始数据中情感类别的不均衡分布;其次,提出深度联合变分自编码器,融合跨模态互补语义信息并获得判别性特征表示;最后,实现基于掩码重建与分类优化的多任务图神经网络,以解决类别边界学习中的过拟合与欠拟合问题,并完成跨模态情感识别。在IEMOCAP与MELD基准数据集上的大量实验表明,CBERL在情感识别效果上取得了显著的性能提升。尤其在少数类别的恐惧与厌恶情感标签上,该模型将准确率与F1值提升了10%至20%。