Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models' performance on the source language does not degrade after continued pre-training.
翻译:多语言预训练语言模型(mPLMs)为许多低资源语言带来了显著优势。为了进一步扩展这些模型所能支持的语言范围,许多研究聚焦于对这些模型进行持续预训练。然而,针对如何将mPLMs扩展到先前未支持的低资源语言,相关研究则相对较少。为解决这一问题,我们利用目标语言语料库扩展了模型的词汇表。接着,我们从模型原有的、偏向于表示源语言(例如英语)的词汇表中筛选出一个子集,并利用双语词典来初始化扩展词汇的表示。随后,基于这些扩展词汇的表示,我们使用目标语言语料库对mPLMs进行持续预训练。实验结果表明,在词性标注和命名实体识别任务中,我们提出的方法优于使用随机初始化扩展词汇进行持续预训练的基线方法,分别取得了0.54%和2.60%的性能提升。此外,我们的方法在训练语料选择方面表现出较高的鲁棒性,并且模型在源语言上的性能在持续预训练后并未下降。