Multimodal sentiment analysis aims to identify the emotions expressed by individuals through visual, language, and acoustic cues. However, most existing research assume that all modalities are available during both training and testing, which makes their algorithms susceptible to the missing-modality scenarios. In this paper, we propose a novel knowledge-transfer network to translate between different modalities to reconstruct the missing audio features. Moreover, we develop a cross-modality attention mechanism to maximize the information extracted from the reconstructed and observed modalities for sentiment prediction. Extensive experiments on three publicly available datasets demonstrate significant improvements over baseline methods and achieve comparable results to the previous methods with complete multi-modality supervision.
翻译:多模态情感分析旨在通过视觉、语言和声学线索识别个体表达的情感。然而,现有研究大多假设在训练和测试阶段所有模态均可用,导致其算法易受缺失模态场景影响。本文提出了一种新颖的知识迁移网络,通过在不同模态之间进行转换来重建缺失的音频特征。此外,我们开发了一种跨模态注意力机制,以最大化从重建模态和观测模态中提取的信息用于情感预测。在三个公开数据集上的大量实验表明,该方法较基线方法有显著改进,并在完整多模态监督条件下取得了与先前方法相当的结果。