Evaluating the Impact of Post-Training Quantization on Large Language Models for Code Generation

Large Language Models (LLMs) have shown an impressive capability in code generation. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.

翻译：大型语言模型（LLMs）在代码生成方面展现出卓越能力。通常，LLM的效能随其规模增大而提升：可训练参数越多，其实现代码的能力越强。然而，在部署基于LLM的代码生成器时，规模更大的LLM会带来显著的内存（以及相应的碳）足迹挑战。Wei等人的先前研究提出利用量化技术来减少基于LLM的代码生成器的内存占用，同时不显著降低其效能。简言之，他们研究了参数规模高达160亿的LLM，将其精度从32位浮点数量化至8位整数，并证明了这对代码生成性能的影响有限。鉴于LLM能力与量化技术的快速发展，本研究对Wei等人的工作进行了差异化复现，其中我们考虑了：（i）更新且规模更大的代码相关LLM，参数规模高达340亿；（ii）模型量化技术的最新进展，允许将压缩推至每个模型参数2比特的极端量化水平；（iii）用于指导量化过程的不同类型校准数据集，包括代码专用数据集。我们的实证评估表明，LLM量化的新前沿是4比特精度，与原始模型相比平均可减少70%的内存占用，且未观察到性能显著下降。此外，当量化更为极端（3比特和2比特）时，代码专用校准数据集有助于限制性能损失。