Multilingual language model (LM) have become a powerful tool in NLP especially for non-English languages. Nevertheless, model parameters of multilingual LMs remain large due to the larger embedding matrix of the vocabulary covering tokens in different languages. On the contrary, monolingual LMs can be trained in a target language with the language-specific vocabulary only, but this requires a large budget and availability of reliable corpora to achieve a high-quality LM from scratch. In this paper, we propose vocabulary-trimming (VT), a method to reduce a multilingual LM vocabulary to a target language by deleting irrelevant tokens from its vocabulary. In theory, VT can compress any existing multilingual LM to build monolingual LMs in any language covered by the multilingual LM. In our experiments, we show that VT can retain the original performance of the multilingual LM, while being smaller in size (in general around 50% of the original vocabulary size is enough) than the original multilingual LM. The evaluation is performed over four NLP tasks (two generative and two classification tasks) among four widely used multilingual LMs in seven languages. Finally, we show that this methodology can keep the best of both monolingual and multilingual worlds by keeping a small size as monolingual models without the need for specifically retraining them, and even limiting potentially harmful social biases.
翻译:多语言语言模型已成为自然语言处理(NLP)中的强大工具,尤其适用于非英语语言。然而,由于覆盖不同语言标记的词汇嵌入矩阵较大,多语言语言模型的参数量仍然庞大。相比之下,单语言语言模型可以仅使用目标语言的特定词汇进行训练,但这需要大量预算和可靠语料库才能从头构建高质量模型。本文提出了一种名为“词汇裁剪”(VT)的方法,通过删除与目标语言无关的标记,将多语言语言模型的词汇表缩减至目标语言。理论上,VT可压缩任何现有的多语言语言模型,为其所覆盖的任何语言构建单语言语言模型。实验表明,VT在保持多语言语言模型原始性能的同时,其规模小于原始模型(通常保留约50%的原始词汇量即可)。我们基于7种语言、4种广泛使用的多语言语言模型,在4项NLP任务(2项生成任务和2项分类任务)上进行了评估。最后,我们证明该方法能兼顾单语言与多语言模型的优势:在保持单语言模型小规模的同时,无需重新训练,甚至能限制潜在的有害社会偏见。