Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor \( q_k \in (0,1] \). We show that this shrinkage affect the usual stepsize \( μ_k \) with an effective stepsize \( μ_k q_k \), slowing convergence when \( q_{\min} < 1 \). With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by \( q_{\min} \), and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard SGD convergence setup.
翻译:低精度训练对于降低大规模深度学习的计算与内存成本已变得至关重要。然而,梯度量化会引入幅度收缩,这可能改变随机梯度下降(SGD)的收敛方式。在本研究中,我们探索了梯度收缩模型下的SGD收敛性,其中每个随机梯度均按比例因子 \( q_k \in (0,1] \) 进行缩放。我们证明这种收缩将常规步长 \( μ_k \) 转化为有效步长 \( μ_k q_k \),当 \( q_{\min} < 1 \) 时会减缓收敛速度。在典型的平滑性与有界方差假设下,我们证明了低精度SGD仍然收敛,但其收敛速度受 \( q_{\min} \) 限制而减慢,并且由于量化效应会达到更高的稳态误差水平。我们通过将低数值精度视为标准SGD收敛框架内的梯度收缩,从理论上分析了其如何减缓训练过程。