Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a societal reactions benchmark, where each image-caption (IC) pair is annotated with multiple emotions and the reasons for feeling them. Socratis contains 18K free-form reactions for 980 emotions on 2075 image-caption pairs from 5 widely-read news and image-caption (IC) datasets. We benchmark the capability of state-of-the-art multimodal large language models to generate the reasons for feeling an emotion given an IC pair. Based on a preliminary human study, we observe that humans prefer human-written reasons over 2 times more often than machine-generated ones. This shows our task is harder than standard generation tasks because it starkly contrasts recent findings where humans cannot tell apart machine vs human-written news articles, for instance. We further see that current captioning metrics based on large vision-language models also fail to correlate with human preferences. We hope that these findings and our benchmark will inspire further research on training emotionally aware models.
翻译:现有情感预测基准包含粗粒度的情感标签,未能考虑图像与文本因多种原因在人类中引发的多样化情感反应。学习多模态内容引发的不同反应至关重要,因为智能机器在为社会生成和传递内容方面正发挥核心作用。为弥补这一空白,我们提出了社会反应基准Socratis,其中每个图像-标题(IC)对都标注了多重情感及其产生原因。Socratis包含来自5个广泛阅读的新闻与图像-标题数据集的2075个IC对的980种情感上的18000条自由形式反应。我们评估了最先进的多模态大语言模型在给定IC对时生成情感原因的能力。基于初步人类研究,我们发现人类偏好人工撰写原因的频次是机器生成原因的两倍以上。这表明我们的任务比标准生成任务更具挑战性,因为这与近期研究发现形成鲜明对比——例如人类已无法区分机器与人类撰写的新闻文章。我们进一步发现,基于大型视觉-语言模型的现有标题生成指标也无法与人类偏好相关。希望这些发现及我们的基准能启发对训练具备情感感知能力的模型的进一步研究。