In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.
翻译:近年来,大语言模型(LLMs)取得了显著进展,以ChatGPT为代表的模型在各类复杂任务中展现出卓越能力。然而,许多主流大语言模型(如LLaMA)是在以英语为主的语料上进行预训练的,这限制了它们在非英语语言中的表现。本文聚焦于如何有效将语言生成和指令遵循能力迁移至非英语语言。为回答这一问题,我们基于LLaMA开展了大规模实证研究,累计计算时间超过1440 GPU小时。我们系统分析了词汇扩展、继续预训练和指令微调等关键因素对迁移效果的影响。为准确评估模型的知识水平,我们采用了四个广泛使用的标准化测试基准:C-Eval、MMLU、AGI-Eval和GAOKAO-Bench。此外,我们还基于包含17个不同类别指令任务的LLM-Eval基准,从准确性、流畅性、信息量、逻辑连贯性和无害性等多个维度对模型的回答质量进行了全面评估。评估结果表明,在知识对齐和回答质量方面,使用不到1%的预训练数据即可达到与最先进迁移模型相当的性能。此外,针对13种低资源语言的实验也呈现出相似趋势。我们期望实验揭示的结论能帮助社区开发非英语大语言模型。