Large Language Models (LLMs) have shown an impressive capability in code generation. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.
翻译:大型语言模型(LLMs)在代码生成方面展现出卓越能力。通常,LLM的效能随其规模增大而提升:可训练参数越多,其实现代码的能力越强。然而,在部署基于LLM的代码生成器时,规模更大的LLM会带来显著的内存(以及相应的碳)足迹挑战。Wei等人的先前研究提出利用量化技术来减少基于LLM的代码生成器的内存占用,同时不显著降低其效能。简言之,他们研究了参数规模高达160亿的LLM,将其精度从32位浮点数量化至8位整数,并证明了这对代码生成性能的影响有限。鉴于LLM能力与量化技术的快速发展,本研究对Wei等人的工作进行了差异化复现,其中我们考虑了:(i)更新且规模更大的代码相关LLM,参数规模高达340亿;(ii)模型量化技术的最新进展,允许将压缩推至每个模型参数2比特的极端量化水平;(iii)用于指导量化过程的不同类型校准数据集,包括代码专用数据集。我们的实证评估表明,LLM量化的新前沿是4比特精度,与原始模型相比平均可减少70%的内存占用,且未观察到性能显著下降。此外,当量化更为极端(3比特和2比特)时,代码专用校准数据集有助于限制性能损失。