Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.
翻译:近期,诸如大语言模型(LLMs)和扩散模型等深度学习方法的进展,催生了对改进量化方法的需求,这些方法需在满足现代架构计算需求的同时保持模型精度。为此,我们研究了FP8数据格式在训练后量化中的优势,覆盖75种不同网络架构,涉及机器翻译、语言建模、文本生成、图像分类与生成、以及图像分割等广泛任务。我们考察了三种FP8表示形式(E5M2、E4M3和E3M4),以分析动态范围与精度之间的权衡对模型精度的不同程度影响。基于广泛研究,我们开发了一种能泛化至不同网络架构的量化工作流程。实证结果表明,FP8格式在多个方面优于INT8,包括任务覆盖率(92.64%对比65.87%)、模型精度以及更广泛操作的适用性。此外,我们的发现表明E4M3更适用于自然语言处理(NLP)模型,而E3M4在计算机视觉任务上略优于E4M3。相关代码已在Intel Neural Compressor开源:https://github.com/intel/neural-compressor。