We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.
翻译:本文介绍了我们持续开发的大语言模型系列ChatGLM。本报告主要聚焦于GLM-4语言模型系列,包括GLM-4、GLM-4-Air与GLM-4-9B。这些模型凝聚了前三代ChatGLM研发的全部经验与洞见,是我们目前能力最强的模型。截至目前,GLM-4模型已在以中英文为主、涵盖24种语言的数万亿token语料上进行预训练,并主要针对中英文场景进行对齐优化。高质量的对齐通过包含监督微调与人类反馈学习的多阶段后训练流程实现。评估表明,GLM-4在以下方面表现突出:1)在MMLU、GSM8K、MATH、BBH、GPQA、HumanEval等通用评测指标上接近或超越GPT-4;2)在指令遵循能力(IFEval评测)上接近GPT-4-Turbo;3)在长上下文任务中与GPT-4 Turbo(128K)及Claude 3表现相当;4)在中文对齐评测(AlignBench)中优于GPT-4。GLM-4 All Tools模型进一步强化了用户意图理解与自主工具调用能力,能够根据复杂任务需求自主决策调用时机与工具类型——包括网页浏览器、Python解释器、文生图模型及用户自定义函数。在实际应用中,其在网页信息检索与Python数学问题求解等任务上达到甚至超越了GPT-4 All Tools的水平。研发过程中,我们开源了包括三代ChatGLM-6B、GLM-4-9B(128K, 1M)、GLM-4V-9B、WebGLM、CodeGeeX在内的系列模型,仅2023年在Hugging Face平台的下载量即超千万次。开源模型可通过https://github.com/THUDM 与 https://huggingface.co/THUDM 获取。