Despite rapid progress in large language models (LLMs), their performance on a vast majority of languages remains unsatisfactory. In this paper, we study building language-specific LLMs by adapting monolingual and multilingual LLMs. We conduct systematic experiments on how design choices (base model selection, vocabulary extension, and continued pretraining) impact the adapted LLM, both in terms of efficiency (how many tokens are needed to encode the same amount of information) and end task performance. We find that (1) the initial performance of LLM does not always correlate with the final performance after the adaptation. Adapting an English-centric models can yield better results than adapting multilingual models despite their worse initial performance on low-resource languages. (2) Efficiency can easily improved with simple vocabulary extension and continued pretraining in most LLMs we study, and (3) The optimal adaptation method (choice of the base model, new vocabulary size, training data, initialization strategy) is highly language-dependent, and the simplest embedding initialization works well across various experimental settings. Together, our work lays foundations on efficiently building language-specific LLMs by adapting existing LLMs.
翻译:尽管大语言模型(LLM)发展迅速,但其在绝大多数语言上的表现仍不尽如人意。本文研究如何通过适配单语和多语大语言模型来构建语言特定的LLM。我们系统性地实验了不同设计选择(基础模型选择、词汇表扩展和持续预训练)如何影响适配后LLM的效率(编码同等信息量所需的词元数量)和最终任务性能。研究发现:(1)LLM的初始性能并不总是与适配后的最终性能相关。适配以英语为中心的模型可能比适配多语言模型获得更好的结果,尽管前者在低资源语言上的初始性能较差。(2)在我们研究的大多数LLM中,通过简单的词汇表扩展和持续预训练可以轻松提升效率。(3)最优适配方法(基础模型选择、新词汇表大小、训练数据、初始化策略)高度依赖于具体语言,而最简单的嵌入初始化策略在各种实验设置下均表现良好。综上所述,我们的工作为通过适配现有LLM高效构建语言特定大语言模型奠定了基础。