Large language models (LLMs) have demonstrated remarkable prowess in language understanding and generation. Advancing from foundation LLMs to instructionfollowing LLMs, instruction tuning plays a vital role in aligning LLMs to human preferences. However, the existing LLMs are usually focused on English, leading to inferior performance in non-English languages. In order to improve the performance for non-English languages, it is necessary to collect language-specific training data for foundation LLMs and construct language-specific instructions for instruction tuning, both of which are heavy loads. To minimize human workload, we propose to transfer the capabilities of language generation and instruction following from English to other languages through an interactive translation task. We have developed BayLing, an instruction-following LLM by utilizing LLaMA as the foundation LLM and automatically constructing interactive translation instructions for instructing tuning. Extensive assessments demonstrate that BayLing achieves comparable performance to GPT-3.5-turbo, despite utilizing a considerably smaller parameter size of only 13 billion. Experimental results on translation tasks show that BayLing achieves 95% of single-turn translation capability compared to GPT-4 with automatic evaluation and 96% of interactive translation capability compared to GPT-3.5-turbo with human evaluation. To estimate the performance on general tasks, we created a multi-turn instruction test set called BayLing-80. The experimental results on BayLing-80 indicate that BayLing achieves 89% of performance compared to GPT-3.5-turbo. BayLing also demonstrates outstanding performance on knowledge assessment of Chinese GaoKao and English SAT, second only to GPT-3.5-turbo among a multitude of instruction-following LLMs. Demo, homepage, code and models of BayLing are available.
翻译:大语言模型在语言理解与生成方面展现出卓越能力。从基础大语言模型向指令跟随大语言模型的演进过程中,指令微调在使模型对齐人类偏好方面发挥着关键作用。然而,现有大语言模型通常以英语为中心,导致在非英语语言上表现欠佳。为提升非英语语言性能,需要为基础模型收集特定语言的训练数据,并构建特定语言的指令进行微调,这两项工作均需投入大量人力。为最大限度减少人工负担,我们提出通过交互式翻译任务将语言生成和指令跟随能力从英语迁移至其他语言。我们开发了BayLing,该指令跟随大语言模型以LLaMA为基础模型,通过自动构建交互式翻译指令进行微调。大量评估表明,尽管BayLing仅使用130亿参数规模,其性能已与GPT-3.5-turbo相当。在翻译任务实验结果显示:自动评估下BayLing达到GPT-4单轮翻译能力的95%,人工评估下其交互式翻译能力达到GPT-3.5-turbo的96%。为评估通用任务表现,我们创建了多轮指令测试集BayLing-80。BayLing-80上的实验表明,BayLing达到GPT-3.5-turbo性能的89%。在中国高考和英语SAT知识评测中,BayLing在众多指令跟随大语言模型中表现卓越,仅次GPT-3.5-turbo。BayLing的演示、主页、代码及模型均已公开。