Due to the unbalanced training data distribution, the language ability of large language models (LLMs) is often biased towards English. In this paper, we propose to empower pre-trained LLMs on non-English languages by building semantic alignment across languages. We perform instruction-tuning on LLaMA with both translation task data and cross-lingual general task data to obtain cross-lingual models (x-LLaMA). Experiment results on cross-lingual benchmark XQUAD and MLQA show that x-LLaMA models outperform the English instruction-tuned counterpart (Alpaca) by 42.50% on average on six non-English languages. Further experiments on Chinese benchmark C-Eval show that x-LLaMA achieves significant improvement on Chinese humanities tasks, outperforming Alpaca by 8.2%. We also discover that incorporating non-English text on the target side of translation data is particularly effective for boosting non-English ability. Besides, we find that semantic alignment within LLM can be further strengthened as translation task data scales up and we present the formulation of the underlying scaling law. Evaluation results on translation dataset Flores-101 show that \method outperforms previous LLaMA-based models in all evaluated directions. Code and data will be available at: https://github.com/OwenNJU/x-LLM.
翻译:由于训练数据分布不平衡,大型语言模型(LLM)的语言能力往往偏向英语。本文提出通过构建跨语言的语义对齐来增强预训练LLM在非英语语言上的能力。我们使用翻译任务数据和跨语言通用任务数据对LLaMA进行指令微调,获得跨语言模型(x-LLaMA)。在跨语言基准测试XQUAD和MLQA上的实验结果表明,x-LLaMA模型在六种非英语语言上平均比英语指令微调模型(Alpaca)高出42.50%。进一步在中文基准测试C-Eval上的实验显示,x-LLaMA在中文人文任务上取得显著提升,比Alpaca高出8.2%。我们还发现,在翻译数据的目标端引入非英语文本对提升非英语能力尤为有效。此外,我们发现随着翻译任务数据规模扩大,LLM内的语义对齐可以进一步加强,并提出了相应缩放定律的公式。在翻译数据集Flores-101上的评估结果显示,该方法在所有评估方向上均优于先前基于LLaMA的模型。代码和数据将在https://github.com/OwenNJU/x-LLM 开源。