GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

翻译：生成式预训练Transformer模型（如GPT或OPT）凭借其在复杂语言建模任务中的突破性性能而脱颖而出，但其极高的计算和存储成本也构成显著挑战。具体而言，由于模型规模庞大，即便是高精度大型GPT模型的推理也可能需要多个高性能GPU，这限制了此类模型的实用性。尽管已有工作尝试通过模型压缩缓解这一压力，但现有压缩技术的适用性和性能受限于GPT模型的规模和复杂度。本文针对这一挑战提出GPTQ——一种基于近似二阶信息的高精度、高效率单次权重量化方法。具体而言，GPTQ可在约4个GPU小时内完成对含1750亿参数GPT模型的量化，将每个权重的比特宽度降至3或4比特，同时相对于未压缩基线仅有可忽略的精度损失。相较于此前提出的单次量化方法，本方法使压缩增益提升两倍以上，同时保持模型精度，首次实现将含1750亿参数的模型部署于单个GPU上进行生成推理。此外，我们证明该方法在极端量化场景（权重量化至2比特甚至三值量化水平）下仍能保持合理精度。实验表明，这些改进可转化为相较于FP16的端到端推理加速：在高端GPU（NVIDIA A100）上可达约3.25倍，在更具成本效益的GPU（NVIDIA A6000）上可达约4.5倍。实现代码已开源至https://github.com/IST-DASLab/gptq。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

专知会员服务

78+阅读 · 2023年5月8日

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

专知会员服务

22+阅读 · 2022年3月18日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

NeurIPS 2021 | 寻MixTraining: 一种全新的物体检测训练范式

专知会员服务

12+阅读 · 2021年12月9日