Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. We propose BlindTest, a suite of 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. Surprisingly, four state-of-the-art VLMs are, on average, only 56.20% accurate on our benchmark, with \newsonnet being the best (73.77% accuracy). On BlindTest, VLMs struggle with tasks that requires precise spatial information and counting (from 0 to 10), sometimes providing an impression of a person with myopia seeing fine details as blurry and making educated guesses. Code is available at: https://vlmsareblind.github.io/
翻译:具备视觉能力的大型语言模型(VLMs),例如GPT-4o和Gemini 1.5 Pro,正在驱动无数的图文应用,并在许多视觉理解基准测试中取得高分。我们提出了BlindTest,一套包含7项对人类而言极其简单的视觉任务,例如识别:(a) 两个圆是否重叠;(b) 两条线是否相交;(c) 单词中哪个字母被圈出;(d) 计算类似奥运标志中的圆圈数量。令人惊讶的是,四种最先进的VLM在我们的基准测试中平均准确率仅为56.20%,其中\newsonnet表现最佳(73.77%准确率)。在BlindTest上,VLMs在处理需要精确空间信息和计数(从0到10)的任务时遇到困难,有时给人的印象就像近视者将精细细节视为模糊并进行有根据的猜测。代码发布于:https://vlmsareblind.github.io/