Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations$\unicode{x2013}$known as sound symbolism$\unicode{x2013}$which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models' outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.
翻译:音频-视觉预训练模型近年来受到广泛关注,并在多种音频-视觉任务中展现出卓越性能。本研究探讨预训练的音频-视觉模型是否表现出声音与视觉表征之间的非任意关联——即声音象征性——这种关联在人类认知中同样存在。我们构建了一个包含合成图像与音频样本的专用数据集,并在零样本设置下采用非参数方法对这些模型进行评估。研究结果表明,模型输出与既定的声音象征性模式之间存在显著相关性,在基于语音数据训练的模型中尤为明显。这些发现表明,此类模型能够捕捉类似于人类语言处理过程中的声音-意义关联,为认知架构与机器学习策略提供了新的见解。