Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.
翻译:大型语言模型(LLMs)基于少量自然语言指令示例即可在各种自然语言任务中展现出卓越性能,显著减少了对大量特征工程的需求。然而,大多数强大的LLMs要么是闭源的,要么在英语以外的语言能力上存在局限。在本技术报告中,我们提出百川2(Baichuan 2)系列大规模多语言语言模型,包含70亿和130亿参数,从零开始训练,使用了2.6万亿个token。百川2在MMLU、CMMLU、GSM8K和HumanEval等公开基准测试中达到或超越了同类规模的开源模型。此外,百川2在医学、法律等垂直领域表现优异。我们将发布所有预训练模型检查点,以帮助研究社区更好地理解百川2的训练动态。