We present a large, multilingual study into how vision constrains linguistic choice, covering four languages and five linguistic properties, such as verb transitivity or use of numerals. We propose a novel method that leverages existing corpora of images with captions written by native speakers, and apply it to nine corpora, comprising 600k images and 3M captions. We study the relation between visual input and linguistic choices by training classifiers to predict the probability of expressing a property from raw images, and find evidence supporting the claim that linguistic properties are constrained by visual context across languages. We complement this investigation with a corpus study, taking the test case of numerals. Specifically, we use existing annotations (number or type of objects) to investigate the effect of different visual conditions on the use of numeral expressions in captions, and show that similar patterns emerge across languages. Our methods and findings both confirm and extend existing research in the cognitive literature. We additionally discuss possible applications for language generation.
翻译:我们提出一项大规模跨语言研究,探讨视觉如何制约语言选择,涵盖四种语言及五种语言属性(如动词及物性、数词使用等)。我们提出一种新方法,利用现有多语种图像语料库(含母语者撰写的图像描述),并将其应用于九个语料库(包含60万张图像与300万条描述)。通过训练分类器从原始图像预测语言属性表达概率,研究视觉输入与语言选择之间的关系,实验证据支持语言属性受视觉语境约束的跨语言共性。我们辅以语料库研究,以数词为测试案例:具体利用现有标注(物体数量或类别)探究不同视觉条件对描述中数词表达的影响,发现不同语言呈现相似模式。本文方法与发现既印证又拓展了认知领域的现有研究,并进一步探讨了在语言生成中的潜在应用。