Attention Alignment Between Humans and Vision-Language Models

Visual perception depends on top-down goals and bottom-up sensory mechanisms. Vision-language models implement both, allowing us to treat each component as a separable hypothesis about what drives where we look. We compared spatial attention maps from six vision-language models against human fixation heatmaps recorded on 200 images during two tasks (general description and social captioning). The six models spanned a 2$\times$2 factorial of CNN vs.\ ViT encoders crossed with LSTM vs.\ Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. We found that both decoder and encoder architecture shaped alignment, but decoder choice dominated. LSTM vs.\ Transformer decoders increased alignment by 40--50 percentage points (80--87\% vs.\ 40--59\% of the human noise ceiling). In contrast, CNN vs.\ ViT encoders contributed a secondary 5--20 point advantage depending on decoder family, with CNN-LSTM the most aligned model overall (85--87\%). Despite their alignment advantage, LSTM-decoder attention maps were spatially diffuse and minimally task-differentiated; ViT-Transformer, the weakest in alignment, showed the sharpest spatial concentration and strongest task differentiation. A hemispatial-neglect simulation confirmed that ablating attention impacted LSTM decoders more than Transformer decoders. In an exploratory extension using TRIBE-simulated synthetic neural responses, fixation alignment and neural relevance dissociate: CNN-Transformer attention maps better predicted synthetic brain activity despite lower fixation alignment, with attention maps best predicting early visual cortex. Together, top-down and bottom-up components trade off what they predict in behavioral and synthetic neural data.

翻译：视觉感知依赖于自上而下的目标与自下而上的感觉机制。视觉-语言模型兼具这两种机制，使我们能够将每个组件视为关于驱动目光注视因素的可分离假设。我们比较了六个视觉-语言模型的空间注意力图与人类在两任务（通用描述与社交描述）中对200张图像记录的注视热力图。这六个模型涵盖2×2因子设计：CNN与ViT编码器分别与LSTM与Transformer解码器组合，外加Molmo 7B-D和Qwen3.5 9B模型。研究发现，解码器与编码器架构均影响对齐程度，但解码器选择起主导作用。LSTM与Transformer解码器使对齐程度提升40-50个百分点（分别达到人类噪声上限的80-87%与40-59%）。相比之下，CNN与ViT编码器根据解码器家族不同带来次要的5-20个百分点优势，其中CNN-LSTM成为整体对齐度最高的模型（85-87%）。尽管对齐优势显著，LSTM解码器的注意力图在空间上呈弥散分布且任务区分度极低；而对齐度最弱的ViT-Transformer却展现出最强的空间集中性与任务区分度。半空间忽略模拟证实，注意力切除对LSTM解码器的影响大于Transformer解码器。在使用TRIBE模拟合成神经响应的探索性扩展中，注视对齐与神经相关性出现解离：CNN-Transformer的注意力图尽管注视对齐度较低，却能更好预测合成脑活动，其中注意力图对早期视觉皮层的预测效果最佳。综上，自上而下与自下而上组件在行为与合成神经数据预测中呈现权衡关系。