In this technical report, we present TeleChat, a collection of large language models (LLMs) with parameters of 3 billion, 7 billion and 12 billion. It includes pretrained language models as well as fine-tuned chat models that is aligned with human preferences. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, including trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves comparable performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat's 7B and 12B variant, along with code and a portion of our pretraining data, to the public community.
翻译:在本技术报告中,我们介绍了TeleChat系列大语言模型,该系列包含30亿、70亿和120亿参数规模的模型,涵盖预训练语言模型以及与人类偏好对齐的微调对话模型。TeleChat最初在大规模多语料库(包含中英文文本)上进行预训练,处理了数万亿词元。随后,模型通过我们描述的详细方法进行微调以对齐人类偏好。我们在语言理解、数学推理、代码生成和知识型问答等多项任务上评估了TeleChat的性能。实验结果表明,在各类公开基准测试中,TeleChat与同等规模的其它开源模型性能相当。为支持基于大语言模型的未来研究与应用,我们向公众开源了TeleChat 7B和12B变体的微调模型检查点,以及相关代码和部分预训练数据。