Protoform reconstruction is the task of inferring what morphemes or words appeared like in the ancestral languages of a set of daughter languages. Meloni et al. (2021) achieved the state-of-the-art on Latin protoform reconstruction with an RNN-based encoder-decoder with attention model. We update their model with the state-of-the-art seq2seq model: the Transformer. Our model outperforms their model on a suite of different metrics on two different datasets: their Romance data of 8,000 cognates spanning 5 languages and a Chinese dataset (Hou 2004) of 800+ cognates spanning 39 varieties. We also probe our model for potential phylogenetic signal contained in the model. Our code is publicly available at https://github.com/cmu-llab/acl-2023.
翻译:原始形式重建是一项推断祖语中词素或词汇在若干子代语言中形态的任务。Meloni等人(2021)利用基于RNN的编码器-解码器注意力模型,在拉丁语原始形式重建任务中取得了最先进成果。我们采用最先进的序列到序列模型Transformer对其模型进行了更新。在两个不同数据集上,我们的模型在多套评价指标中均优于其模型:其包含8,000个同源词、覆盖5种语言的罗曼语数据集,以及侯精一(2004)编纂的含800余个同源词、覆盖39种汉语变体的数据集。我们还探测了模型可能蕴含的系统发育信号。我们的代码已公开于https://github.com/cmu-llab/acl-2023。