While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, score high on many vision-understanding benchmarks, they are still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks, including identifying (a) whether two circles overlap; (b) how many times two lines intersect; (c) which letter is being circled in a word; and (d) the number of circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.07% accurate on average. Claude 3.5 Sonnet performs the best at 77.84% accuracy, far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs including slow-thinking models consistently struggle with those tasks that require precise spatial information when geometric primitives overlap or are close. Yet, VLMs perform at near-100% accuracy when much more space is added to separate shapes and letters. Linear probing experiments show that vision encoders contain sufficient visual information to solve BlindTest and that language models fail to decode this information into correct answers. Code and data are at: https://vlmsareblind.github.io
翻译:尽管具备视觉能力的大型语言模型(VLMs),例如GPT-4o和Gemini 1.5 Pro,在许多视觉理解基准测试中得分很高,但它们仍然难以完成对人类而言简单的低级视觉任务。具体而言,在BlindTest(我们设计的包含7项极简单任务的测试集)上,包括判断(a)两个圆是否重叠;(b)两条直线相交多少次;(c)单词中哪个字母被圈出;(d)奥林匹克式标志中圆环的数量,四个最先进的VLMs平均准确率仅为58.07%。Claude 3.5 Sonnet表现最佳,准确率为77.84%,远低于人类预期的100%准确率。在不同图像分辨率和线条宽度下,包括慢思考模型在内的VLMs始终难以处理那些在几何图元重叠或接近时需要精确空间信息的任务。然而,当增加大量空间以分隔形状和字母时,VLMs的准确率接近100%。线性探针实验表明,视觉编码器包含足够的视觉信息来解决BlindTest,而语言模型未能将这些信息解码为正确答案。代码和数据位于:https://vlmsareblind.github.io