We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types -- text symbols (. and #) and filled squares without gridlines -- then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder -- the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their native visual pathway. Each model exhibits a distinct failure mode in the squares condition -- systematic under-counting (Claude), massive over-counting (ChatGPT), and template hallucination (Gemini) -- but all share the same underlying deficit: severely degraded spatial localization for non-textual visual elements.
翻译:我们通过一个简单的实验揭示了视觉语言模型(VLMs)的一个根本性局限:当二进制网格中的填充单元格缺乏文本标识时,模型无法准确定位这些单元格。我们生成了15个15x15的网格,其填充密度各不相同(10.7%-41.8%),并将每个网格渲染为两种图像类型——文本符号(. 和 #)以及不带网格线的实心方块——然后要求三个前沿的VLM(Claude Opus、ChatGPT 5.2和Gemini 3 Thinking)对其进行转录。在文本符号条件下,Claude和ChatGPT实现了约91%的单元格准确率和84%的F1分数,而Gemini实现了84%的准确率和63%的F1分数。在实心方块条件下,所有三个模型的性能均大幅下降至60-73%的准确率和29-39%的F1分数。关键的是,所有条件都经过相同的视觉编码器处理——文本符号是以图像形式而非标记化文本输入的。文本与方块条件之间的F1分数差距在三个模型中达到34至54分,这表明VLMs的行为仿佛拥有一条用于空间推理的高保真文本识别通路,其性能显著优于其固有的视觉通路。每个模型在方块条件下都表现出独特的失败模式——系统性少计(Claude)、大规模多计(ChatGPT)和模板幻觉(Gemini)——但它们都共享同一个根本缺陷:对于非文本视觉元素的空间定位能力严重退化。