Using model weights pretrained on a high-resource language as a warm start can reduce the need for data and compute to obtain high-quality language models for other, especially low-resource, languages. However, if we want to use a new tokenizer specialized for the target language, we cannot transfer the source model's embedding matrix. In this paper, we propose FOCUS - Fast Overlapping Token Combinations Using Sparsemax, a novel embedding initialization method that initializes the embedding matrix effectively for a new tokenizer based on information in the source model's embedding matrix. FOCUS represents newly added tokens as combinations of tokens in the overlap of the source and target vocabularies. The overlapping tokens are selected based on semantic similarity in an auxiliary static token embedding space. We focus our study on using the multilingual XLM-R as a source model and empirically show that FOCUS outperforms random initialization and previous work in language modeling and on a range of downstream tasks (NLI, QA, and NER).
翻译:利用基于高资源语言预训练的模型权重作为热启动,可减少为其他语言(尤其是低资源语言)获取高质量语言模型所需的数据和计算资源。然而,若需使用针对目标语言特化的新分词器,则无法直接迁移源模型的嵌入矩阵。本文提出一种名为FOCUS(基于Sparsemax的快速重叠词组合)的新型嵌入初始化方法,该方法能根据源模型嵌入矩阵中的信息,为新分词器高效初始化嵌入矩阵。FOCUS将新增词表示为源词汇表与目标词汇表重叠部分中词汇的组合。重叠词的选取基于辅助静态词嵌入空间中的语义相似度。我们以多语言XLM-R作为源模型开展研究,实验表明,在语言建模及一系列下游任务(NLI、QA和NER)中,FOCUS的性能均优于随机初始化及现有方法。