Large-scale pretrained language models (LMs) are said to ``lack the ability to connect [their] utterances to the world'' (Bender and Koller, 2020). If so, we would expect LM representations to be unrelated to representations in computer vision models. To investigate this, we present an empirical evaluation across three different LMs (BERT, GPT2, and OPT) and three computer vision models (VMs, including ResNet, SegFormer, and MAE). Our experiments show that LMs converge towards representations that are partially isomorphic to those of VMs, with dispersion, and polysemy both factoring into the alignability of vision and language spaces. We discuss the implications of this finding.
翻译:大规模预训练语言模型(LM)被认为“缺乏将其话语与世界联系起来的能力”(Bender and Koller, 2020)。若果真如此,语言模型表征应与计算机视觉模型表征无关。为探究这一问题,我们针对三种不同语言模型(BERT、GPT2和OPT)与三种视觉模型(含ResNet、SegFormer和MAE)进行了实证评估。实验表明,语言模型会收敛至与视觉模型部分同构的表征,其中离散性和多义性均影响着视觉与语言空间的可对齐性。本文就此发现的相关启示展开讨论。