Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85\% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.
翻译:规则语音对应构成了历史语言比较中的主要证据。尽管启发式研究聚焦于规律性,但这往往更多是一种直觉判断而非量化评估,且不规则现象比新语法学派模型所预期的更为常见。鉴于历史语言学中计算方法的近期进展以及标准化词汇数据可用性的提高,我们现在能够改进工作流程并提供此类量化评估。本文提出平衡平均对应模式复现率作为规律性的新度量指标。同时提出一种新的计算方法,利用该指标识别在其对应模式方面缺乏规律性的同源词集。我们通过模拟数据和真实数据的两组实验验证该方法。实验中采用留一法验证来测量同源词集的规律性——其中某个词形已被不规则词形替换,检验我们的方法识别导致不规则性词形的能力。基于真实数据的数据集上,我们的方法达到了85%的整体准确率。我们还展示了使用大型数据集中子样本的优势,以及数据中不规则性增加对结果的影响。通过反思我们新规律性度量指标及其衍生的不规则同源词识别方法的更广泛潜力,我们认为这些方法可在提升现有及未来计算机辅助语言比较数据集质量方面发挥重要作用。