Evidence of Human-Like Visual-Linguistic Integration in Multimodal Large Language Models During Predictive Language Processing

The advanced language processing abilities of large language models (LLMs) have stimulated debate over their capacity to replicate human-like cognitive processes. One differentiating factor between language processing in LLMs and humans is that language input is often grounded in several perceptual modalities, whereas most LLMs process solely text-based information. Multimodal grounding allows humans to integrate - e.g. visual context with linguistic information and thereby place constraints on the space of upcoming words, reducing cognitive load and improving comprehension. Recent multimodal LLMs (mLLMs) combine a visual-linguistic embedding space with a transformer type attention mechanism for next-word prediction. Here we ask whether predictive language processing based on multimodal input in mLLMs aligns with humans. Two-hundred participants watched short audio-visual clips and estimated predictability of an upcoming verb or noun. The same clips were processed by the mLLM CLIP, with predictability scores based on comparing image and text feature vectors. Eye-tracking was used to estimate what visual features participants attended to, and CLIP's visual attention weights were recorded. We find that alignment of predictability scores was driven by multimodality of CLIP (no alignment for a unimodal state-of-the-art LLM) and by the attention mechanism (no alignment when attention weights were perturbated or when the same input was fed to a multimodal model without attention). We further find a significant spatial overlap between CLIP's visual attention weights and human eye-tracking data. Results suggest that comparable processes of integrating multimodal information, guided by attention to relevant visual features, supports predictive language processing in mLLMs and humans.

翻译：大语言模型（LLM）先进的语言处理能力引发了关于其能否复现类人认知过程的讨论。LLM与人类语言处理的关键差异在于：人类语言输入往往植根于多种感知模态，而多数LLM仅处理纯文本信息。多模态基础使人类能够整合（例如）视觉语境与语言信息，从而限制后续词汇的可能范围，降低认知负荷并提升理解能力。近年出现的多模态LLM（mLLM）通过结合视觉-语言嵌入空间与Transformer型注意力机制实现下一个词预测。本研究探究基于多模态输入的mLLM预测性语言处理是否与人类表现一致。200名参与者观看短时视听片段后，评估即将出现的动词或名词的可预测性。使用相同片段处理mLLM CLIP模型，通过比较图像与文本特征向量计算可预测性得分。同时采用眼动追踪技术估算参与者关注的视觉特征，并记录CLIP的视觉注意力权重。研究发现：可预测性得分的一致性源于CLIP的多模态特性（单一模态先进LLM无此一致性）及其注意力机制（扰动注意力权重或向无注意力机制的多模态模型输入相同数据时一致性消失）。进一步发现CLIP视觉注意力权重与人类眼动数据存在显著空间重叠。结果表明，mLLM与人类均依赖对相关视觉特征的注意力引导，通过类似的多模态信息整合过程支持预测性语言处理。