Figures of speech such as metaphors, similes, and idioms are integral parts of human communication. They are ubiquitous in many forms of discourse, allowing people to convey complex, abstract ideas and evoke emotion. As figurative forms are often conveyed through multiple modalities (e.g., both text and images), understanding multimodal figurative language is an important AI challenge, weaving together profound vision, language, commonsense and cultural knowledge. In this work, we develop the Image Recognition of Figurative Language (IRFL) dataset. We leverage human annotation and an automatic pipeline we created to generate a multimodal dataset, and introduce two novel tasks as a benchmark for multimodal figurative language understanding. We experimented with state-of-the-art vision and language models and found that the best (22%) performed substantially worse than humans (97%). We release our dataset, benchmark, and code, in hopes of driving the development of models that can better understand figurative language.
翻译:比喻、明喻、习语等修辞手法是人类交流的重要组成部分。它们广泛存在于各类话语形式中,使人们能够传达复杂抽象的概念并激发情感。由于比喻形式常通过多模态方式(如文本与图像)传递,理解多模态比喻性语言成为人工智能领域的重要挑战,需要融合深刻的视觉、语言、常识及文化知识。本研究构建了IRFL(Image Recognition of Figurative Language)数据集,通过人工标注与自研自动化流程生成多模态数据集,并引入两项新任务作为多模态比喻性语言理解的基准。我们测试了当前最先进的视觉与语言模型,发现最优模型(22%)的表现远低于人类(97%)。我们公开数据集、基准测试及代码,旨在推动开发能够更好理解比喻性语言的模型。