Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io
翻译:怪异、不寻常和离奇的图像因其挑战常识而激发观察者的好奇心。例如,2022年世界杯期间发布的一幅图像描绘了著名足球明星莱昂内尔·梅西和克里斯蒂亚诺·罗纳尔多下棋,这俏皮地违背了我们对他们竞争应发生在足球场上的预期。人类可以轻松识别并理解这些非常规图像,但人工智能模型也能做到吗?我们提出WHOOPS!,一个用于视觉常识的新数据集和基准。该数据集包含由设计师使用公开可用的图像生成工具(如Midjourney)特意创作的违背常识的图像。我们针对该数据集考虑了多个任务。除了图像描述、跨模态匹配和视觉问答外,我们还引入了一项困难的解释生成任务,其中模型必须识别并解释给定图像为何不寻常。我们的结果表明,最先进的模型(如GPT3和BLIP2)在WHOOPS!上的表现仍落后于人类。我们希望我们的数据集能激发具有更强视觉常识推理能力的AI模型的发展。数据、模型和代码可在项目网站获取:whoops-benchmark.github.io