Multimodal sentiment analysis aims to identify the emotions expressed by individuals through visual, language, and acoustic cues. However, most of the existing research efforts assume that all modalities are available during both training and testing, making their algorithms susceptible to the missing modality scenario. In this paper, we propose a novel knowledge-transfer network to translate between different modalities to reconstruct the missing audio modalities. Moreover, we develop a cross-modality attention mechanism to retain the maximal information of the reconstructed and observed modalities for sentiment prediction. Extensive experiments on three publicly available datasets demonstrate significant improvements over baselines and achieve comparable results to the previous methods with complete multi-modality supervision.
翻译:多模态情感分析旨在通过视觉、语言和声学线索识别个体表达的情感。然而,现有研究大多假设所有模态在训练和测试阶段均可用,导致其算法在模态缺失场景下表现不佳。本文提出一种新颖的知识迁移网络,通过不同模态间的转换来重建缺失的音频模态。此外,我们开发了跨模态注意力机制,以保留重建模态与观测模态的最大信息量用于情感预测。在三个公开数据集上的大量实验表明,该方法较基线模型有显著提升,并在完整多模态监督条件下取得了与现有方法相当的结果。