Figures of speech such as metaphors, similes, and idioms allow language to be expressive, invoke emotion, and communicate abstract ideas that might otherwise be difficult to visualize. These figurative forms are often conveyed through multiple modes, such as text and images, and frequently appear in advertising, news, social media, etc. Understanding multimodal figurative language is an essential component of human communication, and it plays a significant role in our daily interactions. While humans can intuitively understand multimodal figurative language, this poses a challenging task for machines that requires the cognitive ability to map between domains, abstraction, commonsense, and profound language and cultural knowledge. In this work, we propose the Image Recognition of Figurative Language dataset to examine vision and language models' understanding of figurative language. We leverage human annotation and an automatic pipeline we created to generate a multimodal dataset and introduce two novel tasks as a benchmark for multimodal figurative understanding. We experiment with several baseline models and find that all perform substantially worse than humans. We hope our dataset and benchmark will drive the development of models that will better understand figurative language.
翻译:比喻、明喻、习语等修辞手法使语言更具表现力,能唤起情感,并传达难以具象化的抽象概念。这些比喻形式常通过文本与图像等多模态方式传递,频繁出现在广告、新闻、社交媒体等场景中。理解多模态比喻语言是人类交流的基本组成部分,在日常生活中发挥着重要作用。尽管人类能直觉地理解多模态比喻语言,但对机器而言,这一任务极具挑战性,需要具备跨领域映射、抽象推理、常识以及深厚的语言与文化知识等认知能力。本文提出了比喻语言图像识别数据集,用以检验视觉与语言模型对比喻语言的理解能力。我们利用人工标注及自建自动化流水线生成多模态数据集,并引入两项新任务作为多模态比喻理解的基准。通过多个基线模型的实验,我们发现所有模型的表现均显著劣于人类。我们希望该数据集与基准能推动模型对比喻语言理解能力的提升。