Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English. A strong baseline for specializing these models for specific languages is Language-Adaptive Pre-Training (LAPT). However, retaining a large cross-lingual vocabulary and embedding matrix comes at considerable excess computational cost during adaptation. In this study, we propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one. Namely, we address strategies for re-initializing the token embedding matrix after vocabulary specialization. We then provide a systematic experimental comparison of our techniques, in addition to the recently-proposed Focus method. We demonstrate that: 1) Embedding-replacement techniques in the monolingual transfer literature are inadequate for adapting multilingual models. 2) Replacing cross-lingual vocabularies with smaller specialized ones provides an efficient method to improve performance in low-resource languages. 3) Simple embedding re-initialization techniques based on script-wise sub-distributions rival techniques such as Focus, which rely on similarity scores obtained from an auxiliary model.
翻译:预训练的多语言语言模型是现代非英语自然语言处理工具的重要组成部分。语言自适应预训练(LAPT)是专门化这些模型以适应特定语言的强基线方法。然而,在自适应过程中保留大型跨语言词汇表和嵌入矩阵会带来显著的计算成本。本研究提出了几种简单技术,用于将跨语言词汇表替换为紧凑的、语言特定的词汇表。具体而言,我们探讨了在词汇表特化后重新初始化词嵌入矩阵的策略。随后,我们对这些技术以及最近提出的Focus方法进行了系统的实验比较。我们的研究表明:1)单语言迁移文献中的嵌入替换技术不足以适应多语言模型。2)用较小且专门化的词汇表替换跨语言词汇表,为提升低资源语言性能提供了一种高效方法。3)基于脚本级子分布的简单嵌入重新初始化技术,可与依赖辅助模型获得的相似性得分的Focus等技术相媲美。