This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.
翻译:本研究考察了GPTQ等4比特量化方法在大语言模型中的应用,揭示了GPTQ存在的过拟合问题及其在零样本任务中有限的性能提升。以往工作仅聚焦于零样本评估,我们将任务范围扩展至代码生成与抽象式摘要等更多生成类任务,并发现INT4量化在这些任务中表现显著欠佳。然而,由于当前AI硬件缺乏成熟的集成方案与系统加速策略,简单转向FP6等高精度格式会面临性能低下等严峻挑战,导致该方向长期被忽视。研究结果表明,即便采用粗粒度量化方案,FP6仍能在各类算法与任务中保持稳健表现,在准确性与通用性方面展现出显著优势。值得注意的是,通过FP6量化,Codestar-15B模型在代码生成任务中能达到与其FP16版本相当的性能;而像406M这样的小型模型,在摘要任务中也能接近其基线表现——这些均是INT4量化无法实现的。为更好适配各类AI硬件并实现最佳系统性能,我们提出新颖的4+2设计架构,使FP6量化达到与当前最先进的INT4细粒度量化相近的延迟水平。通过该设计,FP6有望成为大语言模型中现有4比特量化方法的有力替代方案。