Trends and opinion mining in social media increasingly focus on novel interactions involving visual media, like images and short videos, in addition to text. In this work, we tackle the problem of visual sentiment analysis of social media images -- specifically, the prediction of image sentiment polarity. While previous work relied on manually labeled training sets, we propose an automated approach for building sentiment polarity classifiers based on a cross-modal distillation paradigm; starting from scraped multimodal (text + images) data, we train a student model on the visual modality based on the outputs of a textual teacher model that analyses the sentiment of the corresponding textual modality. We applied our method to randomly collected images crawled from Twitter over three months and produced, after automatic cleaning, a weakly-labeled dataset of $\sim$1.5 million images. Despite exploiting noisy labeled samples, our training pipeline produces classifiers showing strong generalization capabilities and outperforming the current state of the art on five manually labeled benchmarks for image sentiment polarity prediction.
翻译:社交媒体中的趋势与舆论挖掘日益关注涉及视觉媒体(如图像和短视频)的新型交互,而非仅局限于文本。本文致力于解决社交媒体图像的视觉情感分析问题,具体而言,即预测图像的情感极性。以往研究依赖于人工标注的训练集,而我们提出了一种基于跨模态蒸馏范式的自动化方法,用于构建情感极性分类器;从爬取的多模态(文本+图像)数据出发,我们根据分析对应文本模态情感的文本教师模型输出,在视觉模态上训练学生模型。我们将该方法应用于从Twitter抓取的三个月内随机收集的图像,并在自动清洗后,生成了一个包含约150万张图像的弱标注数据集。尽管训练过程中利用了噪声标注样本,但我们的训练流程所生成的分类器展现出强大的泛化能力,并且在五个用于图像情感极性预测的人工标注基准上,其表现超越了当前最先进水平。