Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual language models to a new target language. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. This one-to-many token mapping improves tremendously the initialization of the embedding table for the target language. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian. These converted models achieve a new state-of-the-art performance on these languages across all sorts of downstream tasks. By reducing significantly the amount of data and time required for training state-of-the-art models, our novel model conversion strategy has the potential to benefit many languages worldwide.
翻译:针对中低资源语言训练单语语言模型时,往往面临预训练数据有限且质量不足的挑战。本研究提出一种新型模型转换策略,将高资源单语语言模型适配至目标语言。通过泛化覆盖源语言与目标语言的词翻译词典,将目标分词器中的词元映射至源语言分词器中语义相似的词元。这种一对多的词元映射显著提升了目标语言嵌入表的初始化效果。我们开展了将高资源模型转换为中低资源语言(荷兰语与弗里斯兰语)的实验,转换后的模型在这些语言的全类别下游任务中均实现了最新最优性能。本研究所提模型转换策略大幅减少了训练最优模型所需的数据量与时间,有望惠及全球众多语言。