In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.
翻译:在传统语言演化研究中,学者们常强调音变规律和语音对应关系对语言谱系树推导的重要性。然而迄今为止,计算方法尚未充分利用这一潜力。尽管有少数研究强调从语音序列层面比较词语的优势,但多数计算语言学研究仍以词汇同源词作为系统发育重建的主要数据来源。基于(a)来自不同语系的十个多样化数据集,以及(b)当前最先进的自动同源词与语音对应关系检测方法,我们首次系统比较了基于语音和基于同源词的两种系统发育重建方案的表现。结果表明:相较于基于语音对应关系重建的谱系树,基于词汇同源词重建的谱系树在广义四元组距离指标上平均降低约三分之一,拓扑结构更接近黄金标准谱系树。