Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit. To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE). We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation. Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 <image, claim, label, explanation> instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. The figurative phenomena can be present either in the image, the caption, or both. We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.
翻译:大型视觉-语言模型(VLM)在需要精细理解字面图像与文本的任务(如视觉问答或视觉蕴含)中已展现出强大的推理能力。然而,当面对包含隐喻或幽默等比喻现象(其含义常隐含不显)的图像与标题时,这些模型的能力鲜有探索。为填补这一空白,我们提出一项新任务及高质量数据集:基于文本解释的视觉比喻语言理解(V-FLUTE)。我们将视觉比喻语言理解问题构架为可解释的视觉蕴含任务,要求模型预测图像(前提)是否蕴含主张(假设),并通过文本解释为预测标签提供依据。通过人机协作框架,我们构建了高质量数据集V-FLUTE,包含6,027个<图像、主张、标签、解释>实例,覆盖五种多样的多模态比喻现象:隐喻、明喻、习语、讽刺与幽默。这些比喻现象可出现在图像、标题或两者中。我们进一步通过自动评价与人工评价,评估当前VLM模型理解比喻现象的能力。