FP8 versus INT8 for efficient deep learning inference

Mart van Baalen,Andrey Kuzmin,Suparna S Nair,Yuwei Ren,Eric Mahurin,Chirag Patel,Sundar Subramanian,Sanghyuk Lee,Markus Nagel,Joseph Soriaga,Tijmen Blankevoort

Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.

翻译：最近，将FP8作为神经网络训练的数字格式的想法在深度学习领域逐渐兴起。鉴于当前大多数训练仍采用全网络FP32，或有时使用混合精度的FP16，将网络部分组件以8位权重的FP8格式运行，有望显著加速深度学习通常昂贵且耗时的训练过程。一个自然的问题随之产生：这一进展对边缘设备上的高效推理意味着什么？在高效推理设备领域，工作负载通常以INT8执行，当效率要求更高时甚至会降至INT4。在本白皮书中，我们比较了FP8与INT格式在设备端高效推理中的性能。我们从理论上展示了神经网络中INT与FP格式的差异，并提供了大量训练后量化与量化感知训练的结果，以揭示这一理论如何转化为实践。我们还进行了硬件分析，表明在专用硬件中，FP格式的计算效率比INT格式低50%至180%。基于我们的研究和对该领域的文献调研，我们得出结论：尽管所提出的FP8格式可能适用于训练，但其推理结果不足以支持为了高效推理而专门实现FP8以取代INT8。我们显示，我们的结果与先前发现基本一致，但格式间的重要比较迄今仍缺失。最后，我们讨论了将FP8训练网络转换为INT8时的情况，并简要探讨了设备端部署的最有效方式，同时提供了多种模型的广泛INT8结果集。