With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on the way towards robust MSA: 1) inefficiency when modeling cross-modal interactions in unaligned multimodal data; and 2) vulnerability to random modality feature missing which typically occurs in realistic settings. In this paper, we propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR). Concretely, EMT employs utterance-level representations from each modality as the global multimodal context to interact with local unimodal features and mutually promote each other. It not only avoids the quadratic scaling cost of previous local-local cross-modal interaction methods but also leads to better performance. To improve model robustness in the incomplete modality setting, on the one hand, DLFR performs low-level feature reconstruction to implicitly encourage the model to learn semantic information from incomplete data. On the other hand, it innovatively regards complete and incomplete data as two different views of one sample and utilizes siamese representation learning to explicitly attract their high-level representations. Comprehensive experiments on three popular datasets demonstrate that our method achieves superior performance in both complete and incomplete modality settings.
翻译:随着用户生成在线视频的普及,多模态情感分析(MSA)近年来受到越来越多的关注。尽管取得了显著进展,但实现鲁棒MSA仍然面临两大挑战:1)在未对齐多模态数据中建模跨模态交互时效率低下;2)对实际场景中常见的随机模态特征缺失的脆弱性。本文提出一种通用统一框架——高效双级特征恢复多模态Transformer(EMT-DLFR)。具体而言,EMT利用每个模态的话语级表示作为全局多模态上下文,与局部单模态特征交互并相互促进。这不仅避免了先前局部-局部跨模态交互方法的二次方缩放成本,还获得了更优性能。为提升模型在不完整模态设置下的鲁棒性,一方面,DLFR执行低级特征重建,隐式地鼓励模型从不完整数据中学习语义信息;另一方面,其创新性地将完整数据与不完整数据视为同一样本的两个不同视角,利用孪生表示学习显式地吸引其高级表示。在三个流行数据集上的综合实验表明,我们的方法在完整和不完整模态设置下均取得了优异性能。