When implementing the gradient descent method in low precision, the employment of stochastic rounding schemes helps to prevent stagnation of convergence caused by the vanishing gradient effect. Unbiased stochastic rounding yields zero bias by preserving small updates with probabilities proportional to their relative magnitudes. This study provides a theoretical explanation for the stagnation of the gradient descent method in low-precision computation. Additionally, we propose two new stochastic rounding schemes that trade the zero bias property with a larger probability to preserve small gradients. Our methods yield a constant rounding bias that, on average, lies in a descent direction. For convex problems, we prove that the proposed rounding methods typically have a beneficial effect on the convergence rate of gradient descent. We validate our theoretical analysis by comparing the performances of various rounding schemes when optimizing a multinomial logistic regression model and when training a simple neural network with an 8-bit floating-point format.
翻译:在低精度实现梯度下降法时,采用随机舍入策略有助于防止由梯度消失效应引起的收敛停滞。无偏随机舍入通过以与小更新量相对大小成比例的概率保留这些更新量来实现零偏差。本研究为梯度下降法在低精度计算中的停滞现象提供了理论解释。此外,我们提出了两种新的随机舍入方案,这些方案以零偏差特性为代价,以更大的概率保留小梯度。我们的方法产生一个恒定的舍入偏差,该偏差平均而言位于下降方向。对于凸问题,我们证明了所提出的舍入方法通常对梯度下降法的收敛速度有积极影响。我们通过比较优化多项逻辑回归模型以及使用8位浮点格式训练简单神经网络时各种舍入方案的性能,验证了我们的理论分析。