Visually Prompted Benchmarks Are Surprisingly Fragile

A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. We open-source VPBench and our analysis framework at: https://lisadunlap.github.io/vpbench/.

翻译：评估视觉语言模型（VLM）的一个关键挑战在于测试模型独立于其文本先验分析视觉内容的能力。近期基准测试（如BLINK）通过视觉提示来探究视觉感知能力，其中关于视觉内容的问题会与问题所指向的坐标配对，且坐标在图像本身中被明确标记。尽管这些基准测试是VLM评估的重要组成部分，但我们发现现有模型对视觉提示中看似无关的细节表现出惊人的脆弱性：仅仅将视觉标记从红色改为蓝色，就可能导致排行榜上模型排名的完全改变。通过在两项视觉提示任务上评估九个常用的开源与闭源VLM，我们证明了基准测试设置中的细节（包括视觉标记设计和数据集大小）对模型性能和排行榜排名具有显著影响。这些效应甚至可以被利用来使较弱模型的表现超越较强模型；例如，略微增大视觉标记的尺寸，就可使开源的InternVL3-8B与Gemini 2.5 Pro等规模大得多的专有模型排名相当甚至更优。我们进一步表明，基准测试中常被忽视的低层推理选择（如API调用中的JPEG压缩级别）也可能导致模型排名变化。这些细节对视觉提示基准测试的影响远大于对传统语义VLM评估的影响。为缓解这种不稳定性，我们对现有数据集进行整理，构建了VPBench——一个包含16种视觉标记变体、规模更大的视觉提示基准测试。我们在以下地址开源了VPBench及我们的分析框架：https://lisadunlap.github.io/vpbench/。