Large Vision Language Models (LVLMs) exhibit strong visual understanding and reasoning abilities. However, whether their internal representations reflect human visual cognition is still under-explored. In this paper, we address this by quantifying LVLM-brain alignment using image-evoked Electroencephalogram (EEG) signals, analyzing the effects of model architecture, scale, and image type. Specifically, by using ridge regression and representational similarity analysis, we compare visual representations from 32 open-source LVLMs with corresponding EEG responses. We observe a structured LVLM-brain correspondence: First, intermediate layers (8-16) show peak alignment with EEG activity in the 100-300 ms window, consistent with hierarchical human visual processing. Secondly, multimodal architectural design contributes 3.4 more to brain alignment than parameter scaling, and models with stronger downstream visual performance exhibit higher EEG similarity. Thirdly, spatiotemporal patterns further align with known cortical visual pathways. These results demonstrate that LVLMs learn human-aligned visual representations and establish neural alignment as a biologically grounded benchmark for evaluating and improving LVLMs. In addition, those results could provide insights that may inform the development of neuro-inspired applications.
翻译:大型视觉语言模型(LVLMs)展现出强大的视觉理解与推理能力。然而,其内部表征是否反映人类视觉认知仍待深入探索。本文通过图像诱发脑电图(EEG)信号量化LVLM-大脑对齐度,分析模型架构、规模及图像类型的影响。具体而言,我们运用岭回归与表征相似性分析方法,比较了32个开源LVLMs的视觉表征与对应EEG响应。研究发现存在结构化的LVLM-大脑对应关系:首先,中间层(8-16层)在100-300毫秒时间窗内与EEG活动呈现峰值对齐,这与人类层次化视觉处理机制一致;其次,多模态架构设计对大脑对齐度的贡献比参数缩放高出3.4倍,且下游视觉性能更强的模型表现出更高的EEG相似性;第三,时空模式进一步与已知的皮层视觉通路相吻合。这些结果表明LVLMs学习了与人类对齐的视觉表征,并确立了神经对齐作为评估和改进LVLMs的生物学基础基准。此外,该研究可为神经启发式应用的开发提供理论洞见。