Training Deep Neural Networks (DNNs) can be computationally demanding, particularly when dealing with large models. Recent work has aimed to mitigate this computational challenge by introducing 8-bit floating-point (FP8) formats for multiplication. However, accumulations are still done in either half (16-bit) or single (32-bit) precision arithmetic. In this paper, we investigate lowering accumulator word length while maintaining the same model accuracy. We present a multiply-accumulate (MAC) unit with FP8 multiplier inputs and FP12 accumulations, which leverages an optimized stochastic rounding (SR) implementation to mitigate swamping errors that commonly arise during low precision accumulations. We investigate the hardware implications and accuracy impact associated with varying the number of random bits used for rounding operations. We additionally attempt to reduce MAC area and power by proposing a new scheme to support SR in floating-point MAC and by removing support for subnormal values. Our optimized eager SR unit significantly reduces delay and area when compared to a classic lazy SR design. Moreover, when compared to MACs utilizing single-or half-precision adders, our design showcases notable savings in all metrics. Furthermore, our approach consistently maintains near baseline accuracy across a diverse range of computer vision tasks, making it a promising alternative for low-precision DNN training.
翻译:训练深度神经网络(DNN)在计算上要求很高,尤其是在处理大型模型时。最近的研究通过引入用于乘法的8位浮点(FP8)格式来缓解这一计算挑战。然而,累加操作仍采用半精度(16位)或单精度(32位)算术进行。在本文中,我们研究了在保持相同模型精度的同时降低累加器字长的问题。我们提出了一种具有FP8乘法器输入和FP12累加功能的乘累加(MAC)单元,该单元利用优化的随机舍入(SR)实现来减轻低精度累加过程中常见的淹没误差。我们研究了改变用于舍入操作的随机位数所带来的硬件影响和精度影响。此外,我们通过提出一种新的方案来支持浮点MAC中的SR,并移除对非规格化数值的支持,尝试减少MAC的面积和功耗。与经典的懒惰式SR设计相比,我们优化的即时SR单元显著减少了延迟和面积。此外,与使用单精度或半精度加法器的MAC相比,我们的设计在所有指标上均显示出显著的节省。而且,我们的方法在多种计算机视觉任务上始终保持接近基线的精度,使其成为低精度DNN训练的一个有前景的替代方案。