In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.
翻译:本研究介绍了CT-LLM,一个20亿参数的大语言模型(LLM),其研发标志着LLM发展向优先关注中文语言的重要转变。CT-LLM独特地从头开始训练,其方法有别于传统范式,主要纳入中文文本数据,使用了总计1.2万亿token的大规模语料库,其中包括8000亿中文token、3000亿英文token和1000亿代码token。这种策略性的语料构成使模型在中文理解和处理方面表现出卓越能力,该能力通过对齐技术得到进一步强化。CT-LLM在CHC-Bench上展现出卓越性能,在中文语言任务中表现优异,并通过监督微调(SFT)展示了其英文处理能力。本研究挑战了当前主要在英文语料上训练LLM再适配其他语言的普遍范式,拓宽了LLM训练方法的前景。通过开源训练中文LLM的全过程,包括详细的数据处理流程(含所获得的大规模适配预训练中文语料库MAP-CC)、精心构建的多学科中文硬核案例基准(CHC-Bench)以及20亿参数的中文微型大语言模型(CT-LLM),我们旨在促进学术界和工业界的进一步探索与创新,为开发更具包容性和通用性的语言模型铺平道路。