The Uniform Information Density (UID) hypothesis posits that speakers are subject to a communicative pressure to distribute information evenly within utterances, minimising surprisal variance. While this hypothesis has been tested empirically, prior studies are limited exclusively to text-only inputs, abstracting away from the perceptual context in which utterances are produced. In this work, we present the first computational study of UID in visually grounded settings. We estimate surprisal using multilingual vision-and-language models over image-caption data in 30 languages and visual storytelling data in 13 languages, together spanning 11 families. We find that grounding on perception consistently smooths the distribution of information, increasing both global and local uniformity across typologically diverse languages compared to text-only settings. In visual narratives, grounding in both image and discourse contexts has additional effects, with the strongest surprisal reductions occurring at the onset of discourse units. Overall, this study takes a first step towards modelling the temporal dynamics of information flow in ecologically plausible, multimodal language use, and finds that grounded language exhibits greater information uniformity, supporting a context-sensitive formulation of UID.
翻译:均匀信息密度(UID)假说认为,说话者受到一种交际压力的制约,需要在话语中均匀分布信息,从而最小化惊异值的方差。尽管该假说已得到实证检验,但先前研究仅局限于纯文本输入,忽略了话语产生的感知语境。本研究首次对视觉接地情境下的UID进行了计算建模研究。我们使用多模态视觉语言模型,对涵盖30种语言的图像描述数据和13种语言的视觉叙事数据(共涉及11个语系)进行惊异值估计。研究发现:与纯文本情境相比,感知接地能持续平滑信息分布,在类型各异的语言中均能提升全局与局部均匀性。在视觉叙事中,图像与语篇语境的双重接地会产生附加效应,其中惊异值最大幅度的降低出现在语篇单元的起始位置。总体而言,本研究首次尝试在生态效度更高的多模态语言使用情境中建模信息流的时间动态特征,并发现接地语言表现出更高的信息均匀性,这为语境敏感的UID理论框架提供了支持。