Visual-textual sentiment analysis aims to predict sentiment with the input of a pair of image and text, which poses a challenge in learning effective features for diverse input images. To address this, we propose a holistic method that achieves robust visual-textual sentiment analysis by exploiting a rich set of powerful pre-trained visual and textual prior models. The proposed method consists of four parts: (1) a visual-textual branch to learn features directly from data for sentiment analysis, (2) a visual expert branch with a set of pre-trained "expert" encoders to extract selected semantic visual features, (3) a CLIP branch to implicitly model visual-textual correspondence, and (4) a multimodal feature fusion network based on BERT to fuse multimodal features and make sentiment predictions. Extensive experiments on three datasets show that our method produces better visual-textual sentiment analysis performance than existing methods.
翻译:视觉-文本情感分析旨在通过输入图像与文本对来预测情感倾向,这对从多样化输入图像中学习有效特征构成了挑战。为解决此问题,我们提出了一种整体方法,通过利用一组丰富且强大的预训练视觉与文本先验模型,实现了鲁棒的视觉-文本情感分析。该方法包含四个组成部分:(1)视觉-文本分支,直接从数据中学习情感分析特征;(2)视觉专家分支,集成一组预训练的“专家”编码器以提取选定的语义视觉特征;(3)CLIP分支,隐式建模视觉-文本对应关系;(4)基于BERT的多模态特征融合网络,用于融合多模态特征并进行情感预测。在三个数据集上的大量实验表明,我们的方法比现有方法取得了更优的视觉-文本情感分析性能。