Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role played by visual and textual context is still lacking. Here, we analyse multiple VLMs employing different strategies to integrate visual and textual modalities, along with language-only counterparts. We measure the alignment between concept representations by models and existing (fMRI) brain responses to concept words presented in two experimental conditions, where either visual (pictures) or textual (sentences) context is provided. Our results reveal that VLMs outperform the language-only counterparts in both experimental conditions. However, controlled ablation studies show that only for some VLMs, such as LXMERT and IDEFICS2, brain alignment stems from genuinely learning more human-like concepts during pretraining, while others are highly sensitive to the context provided at inference. Additionally, we find that vision-language encoders are more brain-aligned than more recent, generative VLMs. Altogether, our study shows that VLMs align with human neural representations in concept processing, while highlighting differences among architectures. We open-source code and materials to reproduce our experiments at: https://github.com/dmg-illc/vl-concept-processing.
翻译:近期研究表明,基于Transformer的视觉-语言模型(VLMs)能够捕捉人脑概念处理的多模态特性。然而,目前仍缺乏对不同类型VLM架构以及视觉与文本语境所起作用的系统性评估。本文分析了采用不同策略整合视觉与文本模态的多种VLM模型及其纯语言对照模型。通过测量模型的概念表征与现有(fMRI)脑响应之间的对齐程度,我们评估了在两种实验条件下(分别提供视觉图像或文本句子作为语境)呈现概念词时的神经对应关系。研究结果表明,VLMs在两种实验条件下均优于纯语言模型。然而,受控消融实验显示,仅部分VLM模型(如LXMERT和IDEFICS2)的脑对齐源于预训练过程中真正学习了更类人的概念表征,而其他模型对推理时提供的语境高度敏感。此外,我们发现视觉-语言编码器比近期出现的生成式VLM具有更高的脑对齐度。综合而言,本研究证实了VLMs在概念处理中与人类神经表征存在对齐,同时揭示了不同架构间的差异性。我们在https://github.com/dmg-illc/vl-concept-processing开源了复现实验所需的代码与材料。