Sound correspondence patterns form the basis of cognate detection and phonological reconstruction in historical language comparison. Methods for the automatic inference of correspondence patterns from phonetically aligned cognate sets have been proposed, but their application to multilingual wordlists requires extremely well annotated datasets. Since annotation is tedious and time consuming, it would be desirable to find ways to improve aligned cognate data automatically. Taking inspiration from trimming techniques in evolutionary biology, which improve alignments by excluding problematic sites, we propose a workflow that trims phonetic alignments in comparative linguistics prior to the inference of correspondence patterns. Testing these techniques on a large standardized collection of ten datasets with expert annotations from different language families, we find that the best trimming technique substantially improves the overall consistency of the alignments. The results show a clear increase in the proportion of frequent correspondence patterns and words exhibiting regular cognate relations.
翻译:语音对应模式是历史语言比较中同源词检测和语音重建的基础。尽管已有研究提出从语音对齐的同源词集中自动推断对应模式的方法,但这些方法在多语言词表上的应用需要极其完善的标注数据集。由于标注工作繁琐耗时,亟需探索自动改善对齐同源数据的方法。受进化生物学中通过排除问题位点改善对齐效果的修剪技术启发,本研究提出一种工作流程:在推断对应模式前,对比较语言学中的语音对齐结果进行修剪。通过在包含十个跨语言家族专家标注数据集的大规模标准化集合上测试,我们发现最优修剪技术能显著提升对齐的整体一致性。结果表明,频繁语音对应模式及呈现规则同源关系的词汇比例明显提高。