We introduce Integer Scale, a novel post-training quantization scheme for large language models that effectively resolves the inference bottleneck in current fine-grained quantization approaches while maintaining similar accuracies. Integer Scale is a free lunch as it requires no extra calibration or fine-tuning which will otherwise incur additional costs. It can be used plug-and-play for most fine-grained quantization methods. Its integration results in at most 1.85x end-to-end speed boost over the original counterpart with comparable accuracy. Additionally, due to the orchestration of the proposed Integer Scale and fine-grained quantization, we resolved the quantization difficulty for Mixtral-8x7B and LLaMA-3 models with negligible performance degradation, and it comes with an end-to-end speed boost of 2.13x, and 2.31x compared with their FP16 versions respectively.
翻译:本文提出整数尺度,一种用于大型语言模型的新型后训练量化方案,能有效解决当前细粒度量化方法中的推理瓶颈,同时保持相近的精度。整数尺度是一种"免费午餐",因其无需额外的校准或微调,避免了附加成本。该方案可即插即用地应用于大多数细粒度量化方法。集成后能在保持相当精度的前提下,相较原始方案实现最高1.85倍的端到端加速。此外,通过所提出的整数尺度与细粒度量化的协同作用,我们成功解决了Mixtral-8x7B和LLaMA-3模型的量化难题,在性能损失可忽略的前提下,分别实现了相较FP16版本2.13倍和2.31倍的端到端加速。