Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.
翻译:大语言模型已成为越来越有用的人类语言处理计算模型,但目前尚不清楚视觉-语言学习是否能使文本表征在自然阅读中更接近人类。本文通过严格文本设定下比较紧密匹配的大语言模型与视觉-语言模型对,从而将多模态训练历史效应与在线视觉输入或跨模态融合相分离。我们利用包含全脑fMRI响应和同步眼动追踪扫视的人类自然阅读数据集评估模型对齐度。结果表明,多模态预训练可能不会在自然阅读中赋予与人类对齐的全局性统一优势,表明语言内部表征仍是建模人类文本处理的关键因素。然而当句子包含更强的视觉语义内容时,VLM的优势可能更具选择性地显现,fMRI和眼动对齐的证据在此汇聚。综合而言,我们的研究为测试视觉学习历史如何塑造模型与人类在语言处理上的对齐提供了受控的计算机模拟框架,表明多模态预训练在自然阅读中对类人语言表征的贡献具有选择性而非全局性。