Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.
翻译:近期深度学习方法的进展(如大语言模型和扩散模型)催生了改进量化方法的需求——这些方法需在保持精度的同时满足现代架构的计算要求。为此,我们研究了FP8数据格式在75种覆盖机器翻译、语言建模、文本生成、图像分类、图像生成及分割等广泛任务的网络架构中,对训练后量化的优势。通过分析三种FP8表示(E5M2、E4M3和E3M4),我们探究了动态范围与精度之间不同权衡程度对模型精度的影响。基于大量研究,我们开发了一种能泛化至不同网络架构的量化工作流。实验结果表明,FP8格式在工作负载覆盖率(92.64% vs. 65.87%)、模型精度及对更广泛操作的适用性等多方面均优于INT8。此外,我们发现E4M3更适用于NLP模型,而E3M4在计算机视觉任务上略优于E4M3。相关代码已开源至Intel Neural Compressor:https://github.com/intel/neural-compressor。