The use of low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models. Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent, even in the simplest linear regression setting. We present the first systematic theoretical study of this fundamental question, analyzing finite-step stochastic gradient descent (SGD) for high-dimensional linear regression under a comprehensive range of quantization targets: data, label, parameter, activation, and gradient. Our novel analytical framework establishes precise algorithm-dependent and data-dependent excess risk bounds that characterize how different quantization affects learning: parameter, activation, and gradient quantization amplify noise during training; data quantization distorts the data spectrum; and data quantization introduces additional approximation error. Crucially, we distinguish the effects of two quantization schemes: we prove that for additive quantization (with constant quantization steps), the noise amplification benefits from a suppression effect scaled by the batch size, while multiplicative quantization (with input-dependent quantization steps) largely preserves the spectral structure, thereby reducing the spectral distortion. Furthermore, under common polynomial-decay data spectra, we quantitatively compare the risks of multiplicative and additive quantization, drawing a parallel to the comparison between FP and integer quantization methods. Our theory provides a powerful lens to characterize how quantization shapes the learning dynamics of optimization algorithms, paving the way to further explore learning theory under practical hardware constraints.
翻译:低比特量化已成为实现大规模模型高效训练不可或缺的技术。尽管其在实证中取得了广泛成功,但关于量化对学习性能影响的严格理论理解仍显著缺乏,即使在最简单的线性回归场景中亦是如此。本文首次对这一基础性问题进行了系统的理论研究,分析了高维线性回归中有限步随机梯度下降(SGD)在全面量化目标下的表现:数据、标签、参数、激活值和梯度。我们提出的新颖分析框架建立了精确的算法依赖与数据依赖超额风险界,刻画了不同量化如何影响学习过程:参数、激活值和梯度量化会放大训练过程中的噪声;数据量化会扭曲数据谱;而标签量化会引入额外的近似误差。关键的是,我们区分了两种量化方案的影响:我们证明对于加性量化(采用恒定量化步长),噪声放大效应会受益于批量大小带来的抑制效果;而乘性量化(采用输入依赖的量化步长)则能基本保持谱结构,从而降低谱失真。此外,在常见的多项式衰减数据谱条件下,我们定量比较了乘性量化与加性量化的风险,并将其与浮点数(FP)和整数量化方法的对比进行了类比。我们的理论为刻画量化如何塑造优化算法的学习动态提供了有力视角,为在实用硬件约束下进一步探索学习理论铺平了道路。