Protoform reconstruction is the task of inferring what morphemes or words appeared like in the ancestral languages of a set of daughter languages. Meloni et al. (2021) achieved the state-of-the-art on Latin protoform reconstruction with an RNN-based encoder-decoder with attention model. We update their model with the state-of-the-art seq2seq model: the Transformer. Our model outperforms their model on a suite of different metrics on two different datasets: their Romance data of 8,000 cognates spanning 5 languages and a Chinese dataset (Hou 2004) of 800+ cognates spanning 39 varieties. We also probe our model for potential phylogenetic signal contained in the model. Our code is publicly available at https://github.com/cmu-llab/acl-2023.
翻译:原词重构是一项推断原始形态或词汇在祖语中样貌的任务,该祖语是若干子语言共同来源的语言。Meloni等人(2021)使用基于RNN的编码器-解码器注意力模型,在拉丁语原词重构任务上取得了当时最优结果。我们采用当前最先进的序列到序列模型——Transformer对其模型进行了更新。我们的模型在两组不同数据集上的多项指标上均优于他们的模型:一组是包含8,000个同源词、覆盖5种语言的罗曼语族数据集,另一组是包含800余个同源词、覆盖39种方言的汉语数据集(Hou 2004)。此外,我们还探究了模型内部蕴含的系统发育信号。我们的代码已公开于 https://github.com/cmu-llab/acl-2023。