Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io

翻译：怪异、不寻常、令人费解的图像因挑战常识而激发观察者的好奇心。例如，2022年世界杯期间流传的一张图片描绘了著名球星莱昂内尔·梅西与克里斯蒂亚诺·罗纳尔多在下国际象棋，这戏谑地违背了我们对他们应在足球场上一决高下的预期。人类能轻松识别并解读这些非常规图像，但AI模型能否做到同样的事情？我们提出了WHOOPS！——一个用于视觉常识的新数据集与基准。该数据集由设计师使用公开可用的图像生成工具（如Midjourney）特意创作的违反常识的图像组成。我们针对数据集设想了多项任务。除图像描述、跨模态匹配与视觉问答外，我们引入了一项困难的解释生成任务，要求模型识别并解释给定图像的异常之处。结果表明，GPT3和BLIP2等最先进模型在WHOOPS！上的表现仍落后于人类。我们希望该数据集能激发具有更强视觉常识推理能力的AI模型的开发。数据、模型与代码可在项目网站whoops-benchmark.github.io获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日