Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
翻译:視覺語言模型(Vision-Language Models,VLMs)已被證明存在“盲視”現象,即便在需要視覺推理的任務中,它們也常常未能充分利用視覺輸入。在本工作中,我們證明了視覺語言模型具有選擇性盲視特徵:它們會根據語言框架來調節對視覺輸入施加的注意力,即使替代框架要求相同的視覺推理。以視覺注意力作為探針,我們量化了框架如何改變圖像上的注意力總量及分佈。與開放式框架相比,約束性框架(如多選題和是否題)會顯著降低對圖像上下文的注意力,減少對任務相關區域的關注,並將注意力轉向無信息的標記。我們進一步證明,這種注意力錯配是導致準確率下降及跨框架不一致的主要原因。基於這一機理洞察,我們引入了一種輕量級的提示微調方法,利用可學習標記來促進開放式場景中觀察到的穩健、基於視覺的注意力模式,從而增強視覺接地能力並提升跨框架性能。