While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Claude 3.5 Sonnet performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together. Code and data are available at: https://vlmsareblind.github.io
翻译:尽管具备视觉能力的大型语言模型(VLMs),例如GPT-4o和Gemini 1.5 Pro,正在驱动各种图文应用,并在许多视觉理解基准测试中取得高分,但我们发现,它们在人类看来简单的低级视觉任务上却出人意料地表现不佳。具体而言,在BlindTest(我们设计的包含7项非常简单的任务的测试套件)上,例如识别(a)两个圆是否重叠;(b)两条线是否相交;(c)单词中哪个字母被圈出;以及(d)计算奥林匹克风格徽标中的圆圈数量,四种最先进的VLMs平均准确率仅为58.57%。Claude 3.5 Sonnet表现最佳,准确率为74.94%,但这仍远低于人类预期的100%准确率。在不同的图像分辨率和线条宽度下,VLMs始终难以完成需要精确空间信息以及识别重叠或紧密相邻的几何基元的任务。代码和数据可在以下网址获取:https://vlmsareblind.github.io