In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.
翻译:近年来,以ChatGPT为代表的大型语言模型(LLMs)在各类复杂任务中展现出卓越能力,取得了显著进展。然而,多数主流LLMs(如LLaMA)基于以英语为主的语料库进行预训练,这限制了其在非英语语言中的表现。本文聚焦于如何有效将语言生成与指令遵循能力迁移至非英语语言。为解答该问题,我们基于LLaMA开展了大规模实证研究,累计消耗超过1440 GPU小时。我们系统分析了词汇扩展、继续预训练及指令微调等关键因素对迁移效果的影响。为准确评估模型知识水平,我们采用了四项广泛使用的标准化测试基准:C-Eval、MMLU、AGI-Eval及GAOKAO-Bench。此外,基于涵盖17类不同指令任务的LLM-Eval基准,我们从准确性、流畅性、信息量、逻辑连贯性及无害性等多维度对模型响应质量进行了综合评价。评估结果表明,在知识对齐与响应质量方面,使用不足1%的预训练数据即可达到与最先进迁移模型相当的性能。同时,针对十三种低资源语言的实验结果也呈现类似趋势。我们期望实验揭示的结论能够助力社区开发非英语LLMs。