We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
翻译:我们提出了Blink,一个针对多模态大语言模型(LLMs)的新基准测试,其专注于其他评估中未涵盖的核心视觉感知能力。Blink中的大多数任务可由人类“在一眨眼之间”解决(例如,相对深度估计、视觉对应性、取证检测和多视角推理)。然而,我们发现这些需要高度感知能力的任务对当前的多模态LLMs构成了重大挑战,因为它们难以通过自然语言进行中介处理。Blink将14个经典的计算机视觉任务重新格式化为3,807个多项选择题,并配以单张或多张图像及视觉提示。虽然人类平均准确率达到95.70%,但Blink对现有的多模态LLMs而言却出人意料地困难:即使是表现最佳的GPT-4V和Gemini,其准确率也仅为51.26%和45.72%,仅比随机猜测高出13.17%和7.63%,这表明此类感知能力在近期的多模态LLMs中尚未“涌现”。我们的分析还强调,专业的计算机视觉模型能够更好地解决这些问题,这为未来的改进指明了潜在路径。我们相信Blink将激励研究社区帮助多模态LLMs追赶人类水平的视觉感知能力。