To address the limitation in multimodal emotion recognition (MER) performance arising from inter-modal information fusion, we propose a novel MER framework based on multitask learning where fusion occurs after alignment, called Foal-Net. The framework is designed to enhance the effectiveness of modality fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL) and cross-modal emotion label matching (MEM). First, AVEL achieves alignment of emotional information in audio-video representations through contrastive learning. Then, a modal fusion network integrates the aligned features. Meanwhile, MEM assesses whether the emotions of the current sample pair are the same, providing assistance for modal information fusion and guiding the model to focus more on emotional information. The experimental results conducted on IEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and emotion alignment is necessary before modal fusion.
翻译:针对多模态情感识别(MER)中因模态间信息融合不足导致的性能局限,本文提出一种基于多任务学习的新型MER框架——Foal-Net,其核心特征为“先对齐后融合”。该框架旨在提升模态融合效能,包含两个辅助任务:音视频情感对齐(AVEL)与跨模态情感标签匹配(MEM)。首先,AVEL通过对比学习实现音视频表征中情感信息的对齐;随后,模态融合网络对已对齐的特征进行整合。与此同时,MEM任务评估当前样本对的情感是否一致,为模态信息融合提供辅助,并引导模型更聚焦于情感信息。在IEMOCAP数据集上的实验结果表明,Foal-Net性能优于现有最优方法,且验证了模态融合前进行情感对齐的必要性。