Humans routinely draw on visual context to predict upcoming words. To what extent current vision-language models produce comparable behaviour is unclear. Here we placed five state-of-the-art pretrained systems side-by-side with 600 human participants in a web-based Visual-World Paradigm. On each of 100 six-second movie clips, models and participants received either text only or synchronised video and text and judged how likely a specified target word was to appear next; human eye movements were tracked throughout. Adding visual context increased model-human alignment in predictability ratings across all architectures (average Delta r = 0.18) with no impact of parameter size. When visual context was informative, transformer attention significantly increased alignment. Attention maps from two transformer models corresponded with human gaze, explaining up to 70% of the inter-participant variance when the scene contained informative cues. Notably, cross-modal attention reliably tracked anticipatory human fixations on semantic cues. These results suggest that current transformer-based vision-language models can approximate human behaviour exploiting visual context during language prediction - and that selective attention to informative cues, not sheer model scale, is the principal driver of this alignment.
翻译:人类通常依赖视觉上下文来预测即将出现的词语。目前视觉-语言模型在何种程度上能产生类似行为尚不清楚。在此,我们将五个最先进的预训练系统与600名人类参与者在基于网络的视觉世界范式中进行对比测试。在每段100个六秒电影片段中,模型和参与者分别仅接收文本或同步视频与文本,并判断指定目标词接下来出现的可能性;同时全程追踪人类眼动。添加视觉上下文使所有架构中的模型-人类可预测性评分对齐度提升(平均Delta r=0.18),且参数规模无显著影响。当视觉上下文具有信息性时,Transformer注意力显著增强对齐度。两个Transformer模型的注意力图与人类注视点对应,当场景包含信息性线索时,可解释高达70%的参与者间差异。值得注意的是,跨模态注意力能可靠追踪人类对语义线索的预期性注视。这些结果表明:当前基于Transformer的视觉-语言模型可在语言预测中近似人类利用视觉上下文的行为——而对信息性线索的选择性注意力(而非单纯模型规模)是这种对齐的主要驱动力。