We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross-lingual transfer. We perform ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe. To demonstrate generalizability of this approach we also adapted Llama 3 8B to Arabic and Llama 2 13B to Hindi.
翻译:本文提出了一种高效方法,用于将单语大语言模型(LLM)适配至另一种语言,以解决灾难性遗忘与分词器限制等挑战。本研究聚焦于将Llama 2模型适配至阿拉伯语。我们的两阶段方法首先扩展词汇表并仅训练嵌入矩阵,随后在双语语料库上进行完整的模型持续预训练。通过对阿拉伯语和英语混合语料库进行持续预训练,该模型在保持英语能力的同时获得了阿拉伯语处理能力。实验结果表明,该方法在阿拉伯语任务上取得显著提升,在英语任务上亦有小幅改进,证明了跨语言迁移的成本效益。我们对嵌入初始化技术、数据混合比例和学习率进行了消融实验,并公布了详细的训练方案。为验证该方法的泛化能力,我们还将Llama 3 8B模型适配至阿拉伯语,并将Llama 2 13B模型适配至印地语。