Multimodal Sentiment Analysis leverages multimodal signals to detect the sentiment of a speaker. Previous approaches concentrate on performing multimodal fusion and representation learning based on general knowledge obtained from pretrained models, which neglects the effect of domain-specific knowledge. In this paper, we propose Contrastive Knowledge Injection (ConKI) for multimodal sentiment analysis, where specific-knowledge representations for each modality can be learned together with general knowledge representations via knowledge injection based on an adapter architecture. In addition, ConKI uses a hierarchical contrastive learning procedure performed between knowledge types within every single modality, across modalities within each sample, and across samples to facilitate the effective learning of the proposed representations, hence improving multimodal sentiment predictions. The experiments on three popular multimodal sentiment analysis benchmarks show that ConKI outperforms all prior methods on a variety of performance metrics.
翻译:多模态情感分析利用多模态信号检测说话者的情感。以往的方法侧重于基于预训练模型获得的通用知识进行多模态融合和表示学习,忽略了领域特定知识的作用。本文提出用于多模态情感分析的对比知识注入方法(ConKI),其中通过基于适配器架构的知识注入,使得每种模态的特定知识表示能够与通用知识表示共同学习。此外,ConKI在每种模态内的知识类型之间、每个样本内的跨模态之间以及样本之间执行层次化对比学习过程,以促进所提出表示的有效学习,从而提升多模态情感预测。在三个流行的多模态情感分析基准上的实验表明,ConKI在多种性能指标上优于所有先前方法。