Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.
翻译:生成式预训练Transformer模型(如GPT或OPT)凭借其在复杂语言建模任务中的突破性性能而脱颖而出,但其极高的计算和存储成本也构成显著挑战。具体而言,由于模型规模庞大,即便是高精度大型GPT模型的推理也可能需要多个高性能GPU,这限制了此类模型的实用性。尽管已有工作尝试通过模型压缩缓解这一压力,但现有压缩技术的适用性和性能受限于GPT模型的规模和复杂度。本文针对这一挑战提出GPTQ——一种基于近似二阶信息的高精度、高效率单次权重量化方法。具体而言,GPTQ可在约4个GPU小时内完成对含1750亿参数GPT模型的量化,将每个权重的比特宽度降至3或4比特,同时相对于未压缩基线仅有可忽略的精度损失。相较于此前提出的单次量化方法,本方法使压缩增益提升两倍以上,同时保持模型精度,首次实现将含1750亿参数的模型部署于单个GPU上进行生成推理。此外,我们证明该方法在极端量化场景(权重量化至2比特甚至三值量化水平)下仍能保持合理精度。实验表明,这些改进可转化为相较于FP16的端到端推理加速:在高端GPU(NVIDIA A100)上可达约3.25倍,在更具成本效益的GPU(NVIDIA A6000)上可达约4.5倍。实现代码已开源至https://github.com/IST-DASLab/gptq。