Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.
翻译:量化技术显著提升了大型语言模型(LLM)训练的计算与内存效率。然而,现有方法仍依赖于在高精度下累积更新:具体而言,梯度更新必须应用于一个高精度权重缓冲区,即所谓的主权重。该缓冲区引入了显著的内存开销,尤其对于稀疏专家混合模型(SMoE),其中模型参数和优化器状态主导了内存使用。为解决此问题,我们提出了误差补偿优化器(ECO),它通过直接将更新应用于量化参数来消除主权重。ECO在每一步后对权重进行量化,并将产生的量化误差精心注入优化器动量中,形成一个无需额外内存的误差反馈循环。我们证明,在标准假设和衰减学习率下,ECO收敛至最优解的一个常半径邻域,而简单移除主权重可能引入与学习率成反比的误差。我们展示了在FP8量化下预训练小型Transformer模型(30-800M)、Gemma-3 1B模型以及2.1B参数稀疏MoE模型的实证结果,并在INT4精度下对DeepSeek-MoE-16B进行微调。在所有实验中,ECO在达到近乎无损精度的前提下与使用主权重的基线方法性能相当,显著改善了静态内存与验证损失之间的帕累托前沿。