Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.
翻译:语言被归类为具有共同语言学特征的语系。尽管这种方法在理解不同语言之间的谱系关系上取得了成功,但仍需更多分析来精确量化它们的相关性,尤其是在句法等研究较少的语言学层面。本文利用从通用依存数据集(Universal Dependencies)中提取的词性(POS)序列探索语言距离。在信息论框架下,我们证明采用三连词性(POS trigrams)能在最大限度捕捉句法变异的同时,与现有数据量保持兼容。通过基于POS分布评估成对距离,进而建立语言关联。引人注目的是,我们的分析揭示了与已知语系和语族相对应的明确聚类,其例外情况可通过不同的形态类型学特征解释。此外,我们发现语言相似性与地理距离之间存在显著相关性,这凸显了空间邻近性对语言亲缘关系的影响。