Multilingual language model (LM) have become a powerful tool in NLP especially for non-English languages. Nevertheless, model parameters of multilingual LMs remain large due to the larger embedding matrix of the vocabulary covering tokens in different languages. On the contrary, monolingual LMs can be trained in a target language with the language-specific vocabulary only, but this requires a large budget and availability of reliable corpora to achieve a high-quality LM from scratch. In this paper, we propose vocabulary-trimming (VT), a method to reduce a multilingual LM vocabulary to a target language by deleting irrelevant tokens from its vocabulary. In theory, VT can compress any existing multilingual LM to build monolingual LMs in any language covered by the multilingual LM. In our experiments, we show that VT can retain the original performance of the multilingual LM, while being smaller in size (in general around 50% of the original vocabulary size is enough) than the original multilingual LM. The evaluation is performed over four NLP tasks (two generative and two classification tasks) among four widely used multilingual LMs in seven languages. Finally, we show that this methodology can keep the best of both monolingual and multilingual worlds by keeping a small size as monolingual models without the need for specifically retraining them, and even limiting potentially harmful social biases.
翻译:多语言语言模型已成为自然语言处理(NLP)中强大的工具,尤其适用于非英语语言。然而,由于覆盖不同语言词元的词表嵌入矩阵规模庞大,多语言语言模型的参数数量仍然较大。相反,单语言语言模型可以仅使用目标语言专用词表进行训练,但这需要大量预算和可靠语料库才能从头构建高质量模型。本文提出词表修剪(Vocabulary Trimming,VT)方法,通过删除词表中不相关词元,将多语言语言模型词表缩减至目标语言。理论上,VT可压缩任何现有多语言语言模型,以构建该模型所涵盖任意语言的单语言语言模型。实验表明,VT在保持多语言语言模型原始性能的同时,其规模小于原始模型(通常保留原词表大小的50%左右即可)。我们针对七种语言的四种广泛使用的多语言语言模型,在四个NLP任务(两个生成任务与两个分类任务)上进行了评估。最后,我们证明该方法能够兼具单语言与多语言模型优势:既保持单语言模型的小规模特性,又无需专门重新训练,同时还能有效限制潜在的有害社会偏见。