Language similarities can be caused by genetic relatedness, areal contact, universality, or chance. Colexification, i.e. a type of similarity where a single lexical form is used to convey multiple meanings, is underexplored. In our work, we shed light on the linguistic causes of cross-lingual similarity in colexification and phonology, by exploring genealogical stability (persistence) and contact-induced change (diffusibility). We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages. We then show the potential of this resource, by investigating several established hypotheses from previous work in linguistics, while proposing new ones. Our results strongly support a previously established hypothesis in the linguistic literature, while offering contradicting evidence to another. Our large scale resource opens for further research across disciplines, e.g.~in multilingual NLP and comparative linguistics.
翻译:语言相似性可能源于亲缘关系、区域接触、普遍性因素或偶然性。词汇共现(即同一词汇形式用于表达多种含义的相似现象)尚未得到充分探索。本研究通过探究谱系稳定性(持久性)和接触引发的语言变化(可扩散性),揭示了跨语言相似性在共词化与音系层面的语言学成因。我们构建了包含1966种语言的语义、谱系、音系和地理数据的大规模图结构,进而通过验证语言学领域既有经典假说并提出新假说,展现了该资源的潜力。研究结果有力支持了语言学文献中的一项既有假说,同时为另一项假说提供了矛盾性证据。本研究所构建的大规模资源将为跨学科研究(如多语言自然语言处理与比较语言学)开辟新路径。