This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.
翻译:本研究考察了大语言模型(LLMs)中诸如GPTQ等4位量化方法,揭示了GPTQ在零样本任务中的过拟合现象及有限提升。既有工作仅聚焦于零样本评估,我们则将任务范围扩展至代码生成与摘要生成等更多生成类别,发现INT4量化在这些任务中表现显著劣化。然而,简单迁移至FP6等高精度格式尤为困难且长期被忽视——这是由于当前AI硬件缺乏成熟的集成方案与系统加速策略所导致的性能瓶颈。实验表明,即便采用粗粒度量化方案,FP6仍能在不同算法与任务中保持稳健表现,展现出精度与通用性的优势。值得注意的是,在代码生成任务中,采用FP6量化的\codestar-15B模型性能可与FP16版本媲美;而针对406M参数等小型模型,其摘要生成结果亦紧密贴近基线水平。上述效果INT4均无法实现。为更好适配各类AI硬件并达成最优系统性能,我们创新性地提出4+2双分量架构,使FP6量化延迟逼近当前最先进的INT4细粒度量化方案。基于该设计,FP6有望成为替代当前LLMs中4位量化方法的理想方案。