Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at https://github.com/RUC-GSAI/YuLan-Chat.
翻译:大语言模型(LLMs)凭借其在处理和理解自然语言方面的广泛能力,已成为许多应用的基础。尽管已有许多开源LLMs随技术报告发布,但训练细节的缺乏阻碍了进一步的研究与开发。本文介绍了YuLan系列的开发,这是一个拥有$120$亿参数的开源大语言模型系列。YuLan的基础模型在大约$1.7$T个来自多样化语料库的token上进行预训练,这些语料包括海量的英文、中文及多语言文本。我们设计了一种三阶段预训练方法来提升YuLan的综合能力。随后的训练阶段结合了指令微调与人类对齐,并使用了大量高质量合成数据。为了促进复杂及长尾知识的学习,我们在所有阶段设计了一个课程学习框架,该框架帮助LLMs以由易到难的方式学习知识。YuLan的训练于2024年1月完成,并在各类英文和中文基准测试中取得了与最先进LLMs相当的性能。本文概述了从零开始开发LLMs的全面技术路线图。我们的模型与代码可在 https://github.com/RUC-GSAI/YuLan-Chat 获取。