This work presents a study on how to exploit the CLIP embedding space to perform Visual Sentiment Analysis. We experiment with two architectures built on top of the CLIP embedding space, which we denote by CLIP-E. We train the CLIP-E models with WEBEmo, the largest publicly available and manually labeled benchmark for Visual Sentiment Analysis, and perform two sets of experiments. First, we test on WEBEmo and compare the CLIP-E architectures with state-of-the-art (SOTA) models and with CLIP Zero-Shot. Second, we perform cross dataset evaluation, and test the CLIP-E architectures trained with WEBEmo on other Visual Sentiment Analysis benchmarks. Our results show that the CLIP-E approaches outperform SOTA models in WEBEmo fine grained categorization, and they also generalize better when tested on datasets that have not been seen during training. Interestingly, we observed that for the FI dataset, CLIP Zero-Shot produces better accuracies than SOTA models and CLIP-E trained on WEBEmo. These results motivate several questions that we discuss in this paper, such as how we should design new benchmarks and evaluate Visual Sentiment Analysis, and whether we should keep designing tailored Deep Learning models for Visual Sentiment Analysis or focus our efforts on better using the knowledge encoded in large vision-language models such as CLIP for this task.
翻译:本文研究如何利用CLIP嵌入空间进行视觉情感分析。我们在CLIP嵌入空间基础上构建了两种架构(统称为CLIP-E),并使用目前最大规模公开人工标注的视觉情感分析基准数据集WEBEmo进行训练。实验分为两部分:首先在WEBEmo上对比CLIP-E架构与当前最优(SOTA)模型及CLIP零样本方法的性能;其次进行跨数据集评估,将在WEBEmo上训练的CLIP-E架构应用于其他视觉情感分析基准。结果表明,CLIP-E方法在WEBEmo细粒度分类任务中优于SOTA模型,且在训练时未见数据集上表现出更强的泛化能力。值得注意的是,FI数据集上的实验显示,CLIP零样本方法比SOTA模型及基于WEBEmo训练的CLIP-E取得了更高的准确率。这些结果引发本文探讨的多项议题,包括如何设计新基准与评估视觉情感分析,以及我们应当继续为视觉情感分析设计定制化的深度学习模型,还是将重点放在更有效地利用CLIP等大规模视觉-语言模型中的知识。