We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.
翻译:本文介绍ChatGLM,这是一个我们持续开发的大型语言模型系列。本报告主要关注GLM-4语言系列,包括GLM-4、GLM-4-Air和GLM-4-9B。它们代表了我们从ChatGLM前三代模型中汲取所有经验教训后开发出的最强模型。截至目前,GLM-4模型已在以中英文为主的十万亿级token及24种语言的小规模语料上进行预训练,并主要针对中英文使用场景进行对齐。高质量的对齐通过多阶段后训练过程实现,包括监督微调和人类反馈学习。评估表明,GLM-4在以下方面表现突出:1)在MMLU、GSM8K、MATH、BBH、GPQA和HumanEval等通用指标上接近或超越GPT-4;2)在IFEval评测的指令遵循能力上接近GPT-4-Turbo;3)在长上下文任务中与GPT-4 Turbo(128K)和Claude 3表现相当;4)在AlignBench评测的中文对齐能力上超越GPT-4。GLM-4 All Tools模型进一步优化了用户意图理解能力,能自主决策调用时机与工具选择——包括网页浏览器、Python解释器、文生图模型及用户自定义函数——以高效完成复杂任务。在实际应用中,其在网页浏览获取在线信息、使用Python解释器解决数学问题等任务上达到甚至超越GPT-4 All Tools的水平。在研发过程中,我们开源了包括ChatGLM-6B(三代)、GLM-4-9B(128K、1M)、GLM-4V-9B、WebGLM和CodeGeeX在内的系列模型,仅2023年在Hugging Face平台的下载量即超千万次。开源模型可通过https://github.com/THUDM 和 https://huggingface.co/THUDM 获取。