Training Deep Neural Networks (DNNs) can be computationally demanding, particularly when dealing with large models. Recent work has aimed to mitigate this computational challenge by introducing 8-bit floating-point (FP8) formats for multiplication. However, accumulations are still done in either half (16-bit) or single (32-bit) precision arithmetic. In this paper, we investigate lowering accumulator word length while maintaining the same model accuracy. We present a multiply-accumulate (MAC) unit with FP8 multiplier inputs and FP12 accumulations, which leverages an optimized stochastic rounding (SR) implementation to mitigate swamping errors that commonly arise during low precision accumulations. We investigate the hardware implications and accuracy impact associated with varying the number of random bits used for rounding operations. We additionally attempt to reduce MAC area and power by proposing a new scheme to support SR in floating-point MAC and by removing support for subnormal values. Our optimized eager SR unit significantly reduces delay and area when compared to a classic lazy SR design. Moreover, when compared to MACs utilizing single-or half-precision adders, our design showcases notable savings in all metrics. Furthermore, our approach consistently maintains near baseline accuracy across a diverse range of computer vision tasks, making it a promising alternative for low-precision DNN training.
翻译:深度神经网络(DNN)的训练计算需求巨大,尤其是在处理大型模型时。近期研究旨在通过引入8位浮点数(FP8)格式进行乘法运算来缓解这一计算挑战。然而,累加操作仍需在半精度(16位)或单精度(32位)算术下完成。本文研究了在保持模型精度不变的前提下降低累加器字长的方法。我们提出了一种乘累加(MAC)单元,其乘法器输入为FP8格式,累加器采用FP12格式,并利用优化的随机舍入(SR)实现来缓解低精度累加中常见的淹没误差。我们探究了改变舍入操作所用随机比特数对硬件实现及精度的影响。此外,我们尝试通过提出一种在浮点MAC中支持SR的新方案以及移除对次正规数值的支持来降低MAC的面积和功耗。与经典的惰性SR设计相比,我们优化的急切SR单元显著减少了延迟和面积。而且,与使用单精度或半精度加法器的MAC相比,我们的设计在所有指标上均显示出显著的节省。更重要的是,我们的方法在多种计算机视觉任务中始终能保持接近基线的精度,这使其成为低精度DNN训练中一个有前景的替代方案。