We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
翻译:摘要:我们提出了Blink——一个聚焦于基础视觉感知能力的新型多模态大语言模型基准,其评估内容区别于现有其他基准。Blink中的大多数任务人类可以“一眨眼间”解决(例如:相对深度估计、视觉对应、取证检测与多视角推理)。然而,我们发现这些需要感知能力的任务对当前多模态大语言模型构成了显著挑战,因为它们难以通过自然语言进行中介处理。Blink将14个经典计算机视觉任务重构为3,807道选择题,配套单/多张图像及视觉提示。尽管人类平均准确率达95.70%,但Blink对现有多模态大语言模型而言出奇地困难:即便表现最佳的GPT-4V和Gemini,其准确率也仅为51.26%和45.72%,仅比随机猜测高13.17%和7.63%,表明这类感知能力尚未在近期多模态大语言模型中“涌现”。我们的分析还表明,专业计算机视觉模型能更出色地解决这些问题,这为未来改进指明了潜在方向。我们相信Blink将激励学界推动多模态大语言模型追赶人类水平的视觉感知能力。