We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.
翻译:我们提出GLM-130B,一个拥有1300亿参数的双语(英文和中文)预训练语言模型。这是对开源一个至少与GPT-3(davinci)性能相当的千亿级模型,并揭示如此规模的模型如何成功进行预训练的一次尝试。在这一过程中,我们面临了许多意料之外的技术和工程挑战,尤其是损失尖峰和发散问题。本文介绍了GLM-130B的训练过程,包括其设计选择、兼顾效率与稳定性的训练策略以及工程实践。最终得到的GLM-130B模型在多项主流英文基准测试中显著优于GPT-3 175B(davinci),而OPT-175B和BLOOM-176B并未展现出这种性能优势。它还在相关基准测试中持续且显著地超越了最大的中文语言模型ERNIE TITAN 3.0 260B。最后,我们利用GLM-130B独特的缩放特性,在无需后训练的情况下实现了INT4量化,且性能几乎无损,使其成为首个达到此量化的千亿级模型。更重要的是,这允许其在4×RTX 3090(24G)或8×RTX 2080 Ti(11G)GPU上高效推理——这些是使用千亿级模型所需的最具性价比的GPU。GLM-130B模型权重已开放获取,其代码、训练日志、相关工具包以及经验教训已在\url{https://github.com/THUDM/GLM-130B/}上开源。