Although recent quantized Large Language Models (LLMs), such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weight matrices required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore the potential of directly updating the quantized low-precision weight matrices without relying on the straight-through estimator during backpropagation, thereby saving memory usage during training. Specifically, we employ a stochastic rounding technique to minimize information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values, (2) extending the bit width to 8 bits results in only a 5% loss degradation compared to BitNet b1.58 while offering the potential for reduced memory usage during training, and (3) our models can also perform inference using ternary weights, showcasing their flexibility in deployment.
翻译:尽管近期出现的量化大型语言模型(如BitNet)已为部署阶段通过二进制或三元权重显著降低内存占用开辟了道路,但这些模型的训练过程仍需消耗大量内存。这部分是因为在整个训练过程中必须持续维护用于直通估计的高精度(即未量化)权重矩阵。为解决这一问题,我们探索了在反向传播过程中不依赖直通估计器、直接更新量化低精度权重矩阵的可行性,从而降低训练阶段的内存占用。具体而言,我们采用随机舍入技术以最小化全程使用低比特权重所造成的信息损失。在我们基于LLaMA架构的模型上的实验结果表明:(1)即使权重被约束为三元值,仅使用低精度权重进行训练是可行的;(2)将比特宽度扩展至8比特时,相较于BitNet b1.58仅产生5%的性能损失,同时具备降低训练内存占用的潜力;(3)我们的模型同样能使用三元权重进行推理,展现了其在部署阶段的灵活性。