Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer's lightness. We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Aypical Object Recognition, to benchmark VLMs' understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad's message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements. Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.
翻译:视觉语言模型(VLMs)在各种任务中展现出强大的零样本泛化能力,尤其是在与大型语言模型(LLMs)集成时。然而,它们对修辞性和说服性视觉媒体(如广告)的理解能力仍未得到充分研究。广告常采用非典型图像,通过令人惊讶的物体并置来传达共享属性。例如,图1(e)展示了一款具有羽毛状纹理的啤酒。这需要高级推理能力来推断这种非典型表征意味着啤酒的轻盈特性。我们引入了三项新颖的任务——多标签非典型性分类、非典型性陈述检索和非典型物体识别——以基准测试VLMs对说服性图像中非典型性的理解。我们评估了VLMs利用非典型性推断广告信息的能力,并通过使用语义上具有挑战性的负例来测试其推理能力。最后,我们开创了非典型性感知的言语化方法,通过提取对非典型元素敏感的全面图像描述。我们的研究结果表明:(1)与LLMs相比,VLMs缺乏高级推理能力;(2)简单有效的策略可以提取非典型性感知信息,从而实现全面的图像言语化;(3)非典型性有助于理解说服性广告。代码和数据将公开提供。