Multilingual language models have pushed state-of-the-art in cross-lingual NLP transfer. The majority of zero-shot cross-lingual transfer, however, use one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to transfer to all target languages, irrespective of their typological, etymological, and phylogenetic relations to other languages. In particular, readily available data and models of resource-rich sibling languages are often ignored. In this work, we empirically show, in a case study for Faroese -- a low-resource language from a high-resource language family -- that by leveraging the phylogenetic information and departing from the 'one-size-fits-all' paradigm, one can improve cross-lingual transfer to low-resource languages. In particular, we leverage abundant resources of other Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for the benefit of Faroese. Our evaluation results show that we can substantially improve the transfer performance to Faroese by exploiting data and models of closely-related high-resource languages. Further, we release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS), and new language models trained on all Scandinavian languages.
翻译:多语言语言模型推动了跨语言自然语言处理迁移的最新技术发展。然而,大多数零样本跨语言迁移方法均使用同一大规模多语言Transformer(如mBERT或XLM-R)向所有目标语言进行迁移,而不考虑这些语言与其他语言在类型学、词源学和系统发育学上的关联。特别是,资源丰富的同源语言的现成数据与模型常被忽略。本研究以法罗语(一种来自高资源语系的低资源语言)为案例,通过实证表明:利用系统发育信息并突破"一刀切"范式,可改进向低资源语言的跨语言迁移效果。具体而言,我们利用其他斯堪的纳维亚语言(丹麦语、挪威语、瑞典语和冰岛语)的丰富资源来促进法罗语发展。评估结果显示,通过利用密切相关的高资源语言的数据与模型,我们能显著提升向法罗语的迁移性能。此外,我们发布了新的法罗语网络语料库及法罗语数据集(用于命名实体识别、语义文本相似度任务),并公布了基于所有斯堪的纳维亚语言训练的新语言模型。