We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.
翻译:本研究揭示了语言-视觉模型CLIP与两种纯文本模型(FastText和SBERT)在编码个体化信息时存在的差异。我们分析了CLIP为基底物质、颗粒状聚集体以及不同数量物体提供的潜在表征。实验证明,CLIP嵌入比纯文本数据训练的模型能更有效地捕捉个体化程度的量化差异。此外,我们从CLIP嵌入推导出的个体化层级结构与语言学和认知科学中提出的层级体系高度吻合。