Evidence of Human-Like Visual-Linguistic Integration in Multimodal Large Language Models During Predictive Language Processing

The advanced language processing abilities of large language models (LLMs) have stimulated debate over their capacity to replicate human-like cognitive processes. One differentiating factor between language processing in LLMs and humans is that language input is often grounded in more than one perceptual modality, whereas most LLMs process solely text-based information. Multimodal grounding allows humans to integrate - e.g. visual context with linguistic information and thereby place constraints on the space of upcoming words, reducing cognitive load and improving perception and comprehension. Recent multimodal LLMs (mLLMs) combine visual and linguistic embedding spaces with a transformer type attention mechanism for next-word prediction. To what extent does predictive language processing based on multimodal input align in mLLMs and humans? To answer this question, 200 human participants watched short audio-visual clips and estimated the predictability of an upcoming verb or noun. The same clips were processed by the mLLM CLIP, with predictability scores based on a comparison of image and text feature vectors. Eye-tracking was used to estimate what visual features participants attended to, and CLIP's visual attention weights were recorded. We find that human estimates of predictability align significantly with CLIP scores, but not for a unimodal LLM of comparable parameter size. Further, alignment vanished when CLIP's visual attention weights were perturbed, and when the same input was fed to a multimodal model without attention. Analysing attention patterns, we find a significant spatial overlap between CLIP's visual attention weights and human eye-tracking data. Results suggest that comparable processes of integrating multimodal information, guided by attention to relevant visual features, supports predictive language processing in mLLMs and humans.

翻译：大语言模型（LLMs）的高级语言处理能力引发了关于它们能否复现类人认知过程的讨论。LLMs与人类语言处理的一个关键差异在于：人类的语言输入通常基于多种感知模态，而多数LLMs仅处理文本信息。多模态基础使人类能够整合——例如将视觉语境与语言信息相结合——从而限制后续词汇空间，降低认知负荷并提升感知与理解能力。近期多模态大语言模型（mLLMs）通过结合视觉与语言嵌入空间及Transformer型注意力机制实现下一词预测。那么，基于多模态输入的预测性语言处理在mLLMs与人类之间究竟存在多大程度的一致性？为回答这一问题，200名受试者观看短时视听片段并评估即将出现的动词或名词的可预测性。同一片段经由mLLM CLIP处理，基于图像与文本特征向量的比较生成可预测性分数。通过眼动追踪记录受试者关注的视觉特征，并同步提取CLIP的视觉注意力权重。研究发现：人类对可预测性的评估与CLIP分数显著一致，但此一致性在参数规模相当的单模态LLM中未出现。此外，当扰动CLIP的视觉注意力权重或为无注意力机制的多模态模型输入相同数据时，一致性立即消失。通过分析注意力模式，我们发现CLIP的视觉注意力权重与人类眼动数据存在显著空间重叠。结果表明：在注意力引导下整合相关视觉特征的多模态信息处理过程，是支撑mLLMs与人类预测性语言处理的共同机制。