GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng,Xiao Liu,Zhengxiao Du,Zihan Wang,Hanyu Lai,Ming Ding,Zhuoyi Yang,Yifan Xu,Wendi Zheng,Xiao Xia,Weng Lam Tam,Zixuan Ma,Yufei Xue,Jidong Zhai,Wenguang Chen,Peng Zhang,Yuxiao Dong,Jie Tang

from arxiv, Accepted to ICLR 2023

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

翻译：我们提出GLM-130B，一个拥有1300亿参数的双语（英文和中文）预训练语言模型。这是对开源一个至少与GPT-3（davinci）性能相当的千亿级模型，并揭示如此规模的模型如何成功进行预训练的一次尝试。在这一过程中，我们面临了许多意料之外的技术和工程挑战，尤其是损失尖峰和发散问题。本文介绍了GLM-130B的训练过程，包括其设计选择、兼顾效率与稳定性的训练策略以及工程实践。最终得到的GLM-130B模型在多项主流英文基准测试中显著优于GPT-3 175B（davinci），而OPT-175B和BLOOM-176B并未展现出这种性能优势。它还在相关基准测试中持续且显著地超越了最大的中文语言模型ERNIE TITAN 3.0 260B。最后，我们利用GLM-130B独特的缩放特性，在无需后训练的情况下实现了INT4量化，且性能几乎无损，使其成为首个达到此量化的千亿级模型。更重要的是，这允许其在4×RTX 3090（24G）或8×RTX 2080 Ti（11G）GPU上高效推理——这些是使用千亿级模型所需的最具性价比的GPU。GLM-130B模型权重已开放获取，其代码、训练日志、相关工具包以及经验教训已在\url{https://github.com/THUDM/GLM-130B/}上开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日