Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.
翻译:语言被划分为共享共同语言特征的语系。尽管这种方法在理解不同语言之间的谱系关系方面取得了成功,但仍需更多分析来准确量化其亲缘关系,尤其是在句法等研究较少的语言层面。本文利用从Universal Dependencies数据集中提取的词性序列来探索语言距离。在信息论框架内,我们证明采用词性三元组能在最大化捕捉句法变异可能性的同时,与现有数据规模保持兼容。随后通过评估基于词性分布的成对距离来建立语言关联。有趣的是,我们的分析揭示了与已知语系及语族相对应的明确聚类,其例外情况可通过独特的形态类型学解释。此外,我们发现语言相似性与地理距离之间存在显著相关性,这凸显了空间邻近性对语言亲缘关系的影响。