We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.
翻译:我们证明,短程音位依赖关系编码了大规模语言亲缘性模式,这对定量类型学和演化语言学具有直接启示。具体而言,运用信息论框架,我们论证了以二阶马尔可夫链建模的音位序列实质上捕捉了音系系统的统计相关性。这一发现使我们能够借助融合音位发音特征的度量标准,从多语言平行语料库中量化67种现代语言之间的距离。由此产生的音系距离矩阵不仅恢复了主要语系结构,还揭示了接触诱导趋同的特征。值得注意的是,我们观察到该矩阵与地理距离存在显著相关性,从而能够限定印欧语系可能的原始家园区域,这与草原假说相吻合。