ChocoLlama: Lessons Learned From Teaching Llamas Dutch

While Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, their performance often lags in lower-resource, non-English languages due to biases in the training data. In this work, we explore strategies for adapting the primarily English LLMs (Llama-2 and Llama-3) to Dutch, a language spoken by 30 million people worldwide yet often underrepresented in LLM development. We collect 104GB of Dutch text ($32$B tokens) from various sources to first apply continued pretraining using low-rank adaptation (LoRA), complemented with Dutch posttraining strategies provided by prior work. For Llama-2, we consider using (i) the tokenizer of the original model, and (ii) training a new, Dutch-specific tokenizer combined with embedding reinitialization. We evaluate our adapted models, ChocoLlama-2, both on standard benchmarks and a novel Dutch benchmark, ChocoLlama-Bench. Our results demonstrate that LoRA can effectively scale for language adaptation, and that tokenizer modification with careful weight reinitialization can improve performance. Notably, Llama-3 was released during the course of this project and, upon evaluation, demonstrated superior Dutch capabilities compared to our Dutch-adapted versions of Llama-2. We hence apply the same adaptation technique to Llama-3, using its original tokenizer. While our adaptation methods enhanced Llama-2's Dutch capabilities, we found limited gains when applying the same techniques to Llama-3. This suggests that for ever improving, multilingual foundation models, language adaptation techniques may benefit more from focusing on language-specific posttraining rather than on continued pretraining. We hope this work contributes to the broader understanding of adapting LLMs to lower-resource languages, and to the development of Dutch LLMs in particular.

翻译：尽管大型语言模型（LLM）在自然语言理解与生成方面展现出卓越能力，但由于训练数据中的偏差，其在资源相对匮乏的非英语语言上的表现往往欠佳。本研究探讨了将主要基于英语的LLM（Llama-2与Llama-3）适配至荷兰语——这一全球有3000万人使用但在LLM开发中常被忽视的语言——的策略。我们收集了来自不同来源的104GB荷兰语文本（$32$B tokens），首先采用低秩自适应（LoRA）进行持续预训练，并结合已有研究提供的荷兰语后训练策略。对于Llama-2，我们考虑了两种方案：（i）使用原始模型的词元切分器，以及（ii）训练新的荷兰语专用词元切分器并结合嵌入重初始化。我们在标准基准测试及新构建的荷兰语基准测试ChocoLlama-Bench上评估了适配后的模型ChocoLlama-2。实验结果表明，LoRA可有效扩展至语言适配任务，且词元切分器修改配合精细的权重重初始化能够提升模型性能。值得注意的是，Llama-3在本项目进行期间发布，经评估显示其荷兰语能力优于我们基于Llama-2的荷兰语适配版本。因此，我们采用相同适配技术处理Llama-3（使用其原始词元切分器）。虽然我们的适配方法显著提升了Llama-2的荷兰语能力，但将相同技术应用于Llama-3时效果有限。这表明对于持续演进的多语言基础模型，语言适配技术可能更需聚焦于语言特定的后训练而非持续预训练。本研究期望为LLM在低资源语言适配的广泛理解，特别是荷兰语LLM的发展提供参考。