Protolanguage reconstruction is central to historical linguistics. The comparative method, one of the most influential theoretical and methodological frameworks in the history of the language sciences, allows linguists to infer protoforms (reconstructed ancestral words) from their reflexes (related modern words) based on the assumption of regular sound change. Not surprisingly, numerous computational linguists have attempted to operationalize comparative reconstruction through various computational models, the most successful of which have been supervised encoder-decoder models, which treat the problem of predicting protoforms given sets of reflexes as a sequence-to-sequence problem. We argue that this framework ignores one of the most important aspects of the comparative method: not only should protoforms be inferable from cognate sets (sets of related reflexes) but the reflexes should also be inferable from the protoforms. Leveraging another line of research -- reflex prediction -- we propose a system in which candidate protoforms from a reconstruction model are reranked by a reflex prediction model. We show that this more complete implementation of the comparative method allows us to surpass state-of-the-art protoform reconstruction methods on three of four Chinese and Romance datasets.
翻译:原始语重构是历史语言学的核心。比较方法作为语言科学史上最具影响力的理论和方法论框架之一,使语言学家能够基于规则音变的假设,从同源词(现代相关词语)推断原始语形式(重构的祖先词汇)。毫不意外,众多计算语言学家尝试通过各类计算模型实现比较重构的操作化,其中最成功的当属监督式编码器-解码器模型——这类模型将给定同源集合预测原始语形式的问题转化为序列到序列任务。我们认为,该框架忽视了比较方法最重要的一个方面:原始语不仅应能从同源集合(相关同源词集合)推断得出,同源词也应能从原始语反向推导。利用另一研究方向——同源词预测——我们提出一种新系统:重构模型生成的候选原始语形式将通过同源词预测模型进行重排序。实验表明,这种对比较方法更完整的实现使我们在四个中文和罗曼语系数据集中的三个上超越了现有的最优原始语重构方法。