Although the mapping between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available.
翻译:尽管人类语言中声音与意义的映射通常被认为是任意性的,但认知科学研究表明,不同语言和人口群体之间特定声音与意义之间存在显著关联,这种现象被称为音义象征。在意义的诸多维度中,音义象征尤为突出,尤其是在语言与视觉领域的跨模态关联中得到了充分验证。本研究探讨了CLIP和Stable Diffusion等视觉与语言模型是否反映了音义象征现象。通过零样本知识探针检测这些模型的内在知识,我们发现了强有力的证据表明它们确实呈现出这种模式,这与心理语言学中著名的kiki-bouba效应相吻合。本研究提供了一种利用计算工具展示音义象征现象并理解其本质的新方法。我们将公开发布相关代码。