BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.

翻译：大语言模型凭借其强大的生成能力和海量知识，赋能日常生活中的各类任务。然而，这些能力主要集中在高资源语言上，导致低资源语言的生成能力较弱，知识也相对有限。因此，增强大语言模型的多语言能力对于服务全球超过100个语言社群至关重要。增强多语言能力的一种直观方法是构建多种语言的指令数据，但为超过100种语言构建指令数据成本极高。本文介绍了BayLing 2，它通过语言对齐，高效地将生成能力和知识从高资源语言迁移到低资源语言。为此，我们构建了一个包含320万条指令的数据集，其中包括高资源语言指令（中文和英文）以及覆盖100多种语言的跨语言指令，并基于该数据集进行了指令微调，以促进语言间的能力迁移。我们以Llama为基础模型，开发了BayLing-2-7B、BayLing-2-13B和BayLing-3-8B，并对BayLing进行了全面评估。在覆盖100多种语言的多语言翻译任务中，BayLing相较于同等规模的开源模型展现出更优的性能。在多语言知识与理解基准测试中，BayLing在超过20种低资源语言上取得了显著提升，证明了其能够有效地将知识从高资源语言迁移到低资源语言。此外，在英文基准测试上的结果表明，BayLing在增强低资源语言性能的同时，保持了高资源语言的高性能。BayLing的演示、主页、代码和模型均已公开。