We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
翻译:摘要:我们提出Blink,一个专注于核心视觉感知能力的新基准测试,该能力在现有评估中尚未体现。Blink中的大部分任务人类可以"眨眼间"完成(例如:相对深度估计、视觉对应性、取证检测与多视角推理)。然而我们发现这些依赖感知的任务给当前多模态大语言模型(MLLMs)带来显著挑战,因为它们难以通过自然语言进行中介处理。Blink将14项经典计算机视觉任务重新构建为3807道多选题,配套单/多张图像及可视化提示。虽然人类平均准确率达95.70%,但Blink对现有MLLMs构成惊人的挑战:即便性能最优的GPT-4V和Gemini也仅达到51.26%和45.72%,仅比随机猜测分别高出13.17%和7.63%,表明此类感知能力尚未在近期MLLMs中"涌现"。分析还指出,专业CV模型能更优地解决这些问题,为未来改进指明潜在路径。我们相信Blink将激励学界助力多模态大语言模型追赶人类级视觉感知能力。