There is an intricate relation between the properties of an image and how humans behave while describing the image. This behavior shows ample variation, as manifested in human signals such as eye movements and when humans start to describe the image. Despite the value of such signals of visuo-linguistic variation, they are virtually disregarded in the training of current pretrained models, which motivates further investigation. Using a corpus of Dutch image descriptions with concurrently collected eye-tracking data, we explore the nature of the variation in visuo-linguistic signals, and find that they correlate with each other. Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation. Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.
翻译:图像属性与人类在描述图像时的行为之间存在复杂关联。这种行为表现出显著的变异性,体现在眼动等人类信号以及人类开始描述图像的时间点上。尽管这些视语言变异信号具有价值,但当前预训练模型的训练几乎完全忽略了它们,这迫切需要进一步研究。我们利用包含同步眼动追踪数据的荷兰语图像描述语料库,探索视语言信号变异性的本质,发现这些信号彼此相关。基于此结果,我们提出假设:变异性部分源于图像属性,并探究预训练视觉编码器编码的图像表征能否捕捉此类变异。研究结果表明,预训练模型捕捉此类变异的能力从弱到中等,提示模型缺乏对人类认知复杂性与输出变异刺激特征的先验知识。