Multi-modal semantic understanding requires integrating information from different modalities to extract users' real intention behind words. Most previous work applies a dual-encoder structure to separately encode image and text, but fails to learn cross-modal feature alignment, making it hard to achieve cross-modal deep information interaction. This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment, which projects the features derived from different modalities into a unified deep space. On multi-modal sarcasm detection (MMSD) and multi-modal sentiment analysis (MMSA) tasks, the experimental results show that our proposed model significantly outperforms several baselines, and our feature alignment strategy brings obvious performance gain over models with different aggregating methods and models even enriched with knowledge. More importantly, our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks. Our source codes are available at https://github.com/ChangKe123/CLFA.
翻译:多模态语义理解需要整合来自不同模态的信息,以挖掘文字背后用户的真实意图。以往大多数研究采用双编码器结构分别对图像和文本进行编码,但未能学习跨模态特征对齐,难以实现跨模态深层信息交互。本文提出一种基于CLIP引导的对比学习架构,用于执行多模态特征对齐,可将不同模态提取的特征投影至统一的深层空间。在多模态讽刺检测(MMSD)和多模态情感分析(MMSA)任务上的实验结果表明,本文提出的模型显著优于多个基线模型,且我们的特征对齐策略相比不同聚合方法及知识增强模型均能带来明显性能提升。更重要的是,该模型实现简单,无需使用任务特定的外部知识,因此可轻松迁移至其他多模态任务。相关源代码已开源至 https://github.com/ChangKe123/CLFA。