The use of low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models. Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent, even in the simplest linear regression setting. We present the first systematic theoretical study of this fundamental question, analyzing finite-step stochastic gradient descent (SGD) for high-dimensional linear regression under a comprehensive range of quantization targets: data, label, parameter, activation, and gradient. Our novel analytical framework establishes precise algorithm-dependent and data-dependent excess risk bounds that characterize how different quantization affects learning: parameter, activation, and gradient quantization amplify noise during training; data quantization distorts the data spectrum and introduces additional approximation error. Crucially, we distinguish the effects of two quantization schemes: we prove that for additive quantization (with constant quantization steps), the noise amplification benefits from a suppression effect scaled by the batch size, while multiplicative quantization (with input-dependent quantization steps) largely preserves the spectral structure, thereby reducing the spectral distortion. Furthermore, under common polynomial-decay data spectra, we quantitatively compare the risks of multiplicative and additive quantization, drawing a parallel to the comparison between FP and integer quantization methods. Our theory provides a powerful lens to characterize how quantization shapes the learning dynamics of optimization algorithms, paving the way to further explore learning theory under practical hardware constraints.
翻译:低比特量化已成为实现大规模模型高效训练不可或缺的技术。尽管其经验应用广泛成功,但对其影响学习性能的严格理论理解仍然显著缺乏,即使在最简单的线性回归设定中亦是如此。我们首次对这一基本问题进行了系统性的理论研究,分析了高维线性回归在全面量化目标(数据、标签、参数、激活值和梯度)下有限步随机梯度下降(SGD)的行为。我们新颖的分析框架建立了精确的算法依赖性和数据依赖性超额风险界,刻画了不同量化如何影响学习:参数、激活值和梯度量化会在训练过程中放大噪声;数据量化则会扭曲数据谱并引入额外的近似误差。关键的是,我们区分了两种量化方案的效果:我们证明对于加法量化(具有恒定量化步长),噪声放大受益于按批量大小缩放的抑制效应;而乘法量化(具有输入依赖的量化步长)在很大程度上保留了谱结构,从而减少了谱失真。此外,在常见的多项式衰减数据谱下,我们定量比较了乘法量化与加法量化的风险,并将其与浮点(FP)和整数量化方法的比较进行了类比。我们的理论提供了一个强大的视角来刻画量化如何塑造优化算法的学习动态,为在实际硬件约束下进一步探索学习理论铺平了道路。