Multimodal emotion recognition (MMER) is an active research field that aims to accurately recognize human emotions by fusing multiple perceptual modalities. However, inherent heterogeneity across modalities introduces distribution gaps and information redundancy, posing significant challenges for MMER. In this paper, we propose a novel fine-grained disentangled representation learning (FDRL) framework to address these challenges. Specifically, we design modality-shared and modality-private encoders to project each modality into modality-shared and modality-private subspaces, respectively. In the shared subspace, we introduce a fine-grained alignment component to learn modality-shared representations, thus capturing modal consistency. Subsequently, we tailor a fine-grained disparity component to constrain the private subspaces, thereby learning modality-private representations and enhancing their diversity. Lastly, we introduce a fine-grained predictor component to ensure that the labels of the output representations from the encoders remain unchanged. Experimental results on the IEMOCAP dataset show that FDRL outperforms the state-of-the-art methods, achieving 78.34% and 79.44% on WAR and UAR, respectively.
翻译:多模态情感识别(MMER)是一个活跃的研究领域,旨在通过融合多种感知模态准确识别人体情感。然而,模态间的固有异质性导致了分布差异与信息冗余,为MMER带来了重大挑战。本文提出一种新颖的细粒度解耦表示学习(FDRL)框架来解决上述挑战。具体而言,我们设计模态共享编码器与模态私有编码器,分别将各模态投影至模态共享子空间和模态私有子空间。在共享子空间中,引入细粒度对齐组件学习模态共享表示,从而捕获模态一致性;随后定制细粒度差异组件约束私有子空间,以学习模态私有表示并增强其多样性;最后引入细粒度预测组件,确保编码器输出表示的标签保持不变。在IEMOCAP数据集上的实验结果表明,FDRL在加权准确率(WAR)和未加权准确率(UAR)上分别达到78.34%和79.44%,优于现有最优方法。