Representations from deep neural networks (DNNs) have proven remarkably predictive of neural activity involved in both visual and linguistic processing. Despite these successes, most studies to date concern unimodal DNNs, encoding either visual or textual input but not both. Yet, there is growing evidence that human meaning representations integrate linguistic and sensory-motor information. Here we investigate whether the integration of multimodal information operated by current vision-and-language DNN models (VLMs) leads to representations that are more aligned with human brain activity than those obtained by language-only and vision-only DNNs. We focus on fMRI responses recorded while participants read concept words in the context of either a full sentence or an accompanying picture. Our results reveal that VLM representations correlate more strongly than language- and vision-only DNNs with activations in brain areas functionally related to language processing. A comparison between different types of visuo-linguistic architectures shows that recent generative VLMs tend to be less brain-aligned than previous architectures with lower performance on downstream applications. Moreover, through an additional analysis comparing brain vs. behavioural alignment across multiple VLMs, we show that -- with one remarkable exception -- representations that strongly align with behavioural judgments do not correlate highly with brain responses. This indicates that brain similarity does not go hand in hand with behavioural similarity, and vice versa.
翻译:深度神经网络(DNN)的表征已被证明能显著预测涉及视觉和语言处理的神经活动。尽管取得了这些成功,但迄今为止的大多数研究关注的是单模态DNN,它们仅编码视觉或文本输入,而非两者。然而,越来越多的证据表明,人类的意义表征整合了语言和感觉运动信息。本文研究了当前视觉与语言DNN模型(VLM)所执行的多模态信息整合是否会产生比仅语言和仅视觉DNN获得的表征更与人类大脑活动对齐的表征。我们重点关注参与者在阅读概念词时(这些词出现在完整句子或伴随图片的语境中)记录的fMRI响应。我们的结果显示,与仅语言和仅视觉DNN相比,VLM表征与语言处理功能相关脑区的激活具有更强的相关性。对不同类型视觉-语言架构的比较表明,近期在下游应用上性能较低但更早的架构相比,生成式VLM往往与大脑的对齐程度较低。此外,通过一项额外分析比较多个VLM在脑对齐与行为对齐上的差异,我们发现——除了一个显著的例外——与行为判断高度对齐的表征并未与大脑响应高度相关。这表明大脑相似性与行为相似性并不一致,反之亦然。