Language similarities can be caused by genetic relatedness, areal contact, universality, or chance. Colexification, i.e.~a type of similarity where a single lexical form is used to convey multiple meanings, is underexplored. In our work, we shed light on the linguistic causes of cross-lingual similarity in colexification and phonology, by exploring genealogical stability (persistence) and contact-induced change (diffusibility). We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages. We then show the potential of this resource, by investigating several established hypotheses from previous work in linguistics, while proposing new ones. Our results strongly support a previously established hypothesis in the linguistic literature, while offering contradicting evidence to another. Our large scale resource opens for further research across disciplines, e.g.~in multilingual NLP and comparative linguistics.
翻译:语言相似性可由亲缘关系、区域接触、普遍性或偶然性造成。同词化现象(即单一词汇形式表达多种含义的相似性类型)仍待深入探究。本研究通过探索谱系稳定性(持久性)和接触引发的变迁(可扩散性),揭示了同词化与音系跨语言相似性的语言学成因。我们构建了融合1966种语言语义、亲缘、音系和地理数据的大规模图结构,并通过验证前人语言学研究中若干既定假说并提出新假说,展示了该图结构的应用潜力。我们的研究结果强烈支持语言学文献中某项已有假说,但同时为另一假说提供了相悖证据。这一大规模研究资源为跨学科研究(如多语言自然语言处理与比较语言学)开辟了新的探索空间。