Large language models (LLMs) have demonstrated remarkable prowess in language understanding and generation. Advancing from foundation LLMs to instructionfollowing LLMs, instruction tuning plays a vital role in aligning LLMs to human preferences. However, the existing LLMs are usually focused on English, leading to inferior performance in non-English languages. In order to improve the performance for non-English languages, it is necessary to collect language-specific training data for foundation LLMs and construct language-specific instructions for instruction tuning, both of which are heavy loads. To minimize human workload, we propose to transfer the capabilities of language generation and instruction following from English to other languages through an interactive translation task. We have developed BayLing, an instruction-following LLM by utilizing LLaMA as the foundation LLM and automatically constructing interactive translation instructions for instructing tuning. Extensive assessments demonstrate that BayLing achieves comparable performance to GPT-3.5-turbo, despite utilizing a considerably smaller parameter size of only 13 billion. Experimental results on translation tasks show that BayLing achieves 95% of single-turn translation capability compared to GPT-4 with automatic evaluation and 96% of interactive translation capability compared to GPT-3.5-turbo with human evaluation. To estimate the performance on general tasks, we created a multi-turn instruction test set called BayLing-80. The experimental results on BayLing-80 indicate that BayLing achieves 89% of performance compared to GPT-3.5-turbo. BayLing also demonstrates outstanding performance on knowledge assessment of Chinese GaoKao and English SAT, second only to GPT-3.5-turbo among a multitude of instruction-following LLMs. Demo, homepage, code and models of BayLing are available.
翻译:大型语言模型在语言理解与生成方面展现出卓越能力。从基础语言模型到指令跟随型语言模型的演进中,指令微调在引导模型适配人类偏好方面发挥着关键作用。然而现有语言模型通常聚焦英语,导致其在非英语语言中表现欠佳。为提升非英语语言性能,需为基础模型采集特定语言训练数据并构建相应语言指令集,这两项工作均属繁重负担。为最小化人工成本,我们提出通过交互式翻译任务,将语言生成与指令跟随能力从英语迁移至其他语言。我们开发了BayLing这一指令跟随型语言模型,基于LLaMA基础模型并自动构建交互式翻译指令进行微调。广泛评估表明,尽管参数量仅130亿,BayLing仍能达到与GPT-3.5-turbo相当的性能。翻译任务实验结果显示,在自动评估中BayLing的单轮翻译能力达到GPT-4的95%,在人工评估中交互式翻译能力达GPT-3.5-turbo的96%。为评估通用任务表现,我们构建了名为BayLing-80的多轮指令测试集。BayLing-80上的实验表明,BayLing性能达到GPT-3.5-turbo的89%。在中文高考与英语SAT知识测评中,BayLing在众多指令跟随型语言模型中表现优异,仅次于GPT-3.5-turbo。BayLing的演示、主页、代码及模型均已公开。