Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.
翻译:多模态情感识别对未来人机交互至关重要。然而,由于不同模态之间的差异以及单模态情感信息表征的困难,准确的情感识别仍面临重大挑战。为解决这些问题,本文提出了一种基于多路径跨模态交互的混合网络模型(MCIHN)。首先,为每个模态分别构建对抗自编码器(AAE)。AAE学习判别性情感特征,并通过解码器重构特征,以获得更具区分度的情感类别信息。随后,将来自不同模态AAE的潜在编码输入预定义的跨模态门控机制模型(CGMM),以减少模态间差异,建立交互模态间的情感关联,并生成不同模态间的交互特征。最后,利用特征融合模块(FFM)进行多模态融合,以实现更优的情感识别。在公开可用的SIMS和MOSI数据集上进行的实验表明,MCIHN取得了卓越的性能。