Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions on sentence acceptability are highly correlated with their normalised log probabilities in general. However, the correlations decrease when visual contexts are present, suggesting that a higher gap exists between the internal representations of LLMs and their generated predictions in the presence of visual contexts. Our experimental work suggests interesting points of similarity and of difference between human and LLM processing of sentences in multimodal contexts.
翻译:先前的研究探讨了深度神经网络(DNNs),特别是Transformer模型,在独立于语境以及在文档语境下预测人类句子可接受性判断的能力。我们考察了预先接触视觉图像(即视觉语境)对人类和大型语言模型(LLMs)进行此类判断的影响。我们的结果表明,与文本语境不同,视觉图像对人类可接受性评分的影响似乎微乎其微,甚至没有影响。然而,LLMs表现出了先前关于人类在文档语境中判断的研究所观察到的压缩效应。不同类型的LLMs能够以很高的准确度预测人类可接受性判断,但总体而言,当移除视觉语境时,它们的性能略好一些。此外,LLMs判断的分布因模型而异,其中Qwen的分布模式与人类相似,而其他模型则与之不同。总体而言,LLM生成的句子可接受性预测与其归一化对数概率高度相关。然而,当存在视觉语境时,相关性降低,这表明在视觉语境存在的情况下,LLMs的内部表征与其生成预测之间存在更大的差距。我们的实验工作揭示了人类与LLMs在多模态语境下处理句子时有趣的一致性与差异性。