Figures of speech such as metaphors, similes, and idioms are integral parts of human communication. They are ubiquitous in many forms of discourse, allowing people to convey complex, abstract ideas and evoke emotion. As figurative forms are often conveyed through multiple modalities (e.g., both text and images), understanding multimodal figurative language is an important AI challenge, weaving together profound vision, language, commonsense and cultural knowledge. In this work, we develop the Image Recognition of Figurative Language (IRFL) dataset. We leverage human annotation and an automatic pipeline we created to generate a multimodal dataset, and introduce two novel tasks as a benchmark for multimodal figurative language understanding. We experimented with state-of-the-art vision and language models and found that the best (22%) performed substantially worse than humans (97%). We release our dataset, benchmark, and code, in hopes of driving the development of models that can better understand figurative language.
翻译:比喻、明喻和习语等修辞手法是人类交流中不可或缺的组成部分。它们普遍存在于各类话语形式中,使人们得以传达复杂抽象的概念并唤起情感。由于比喻性表达常通过多模态(例如文字和图像)形式呈现,理解多模态比喻性语言成为一项重要的人工智能挑战,需要整合深刻的视觉、语言、常识及文化知识。本研究构建了比喻性语言图像识别(IRFL)数据集。我们利用人工标注与自主创建的自动化流程生成多模态数据集,并引入两项新型任务作为多模态比喻性语言理解的基准测试。通过对最新视觉与语言模型进行实验,我们发现最优模型(准确率22%)的表现远逊于人类(97%)。我们公开了数据集、基准测试及代码,以期推动能更好理解比喻性语言的模型研发。