Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by (5.6%) and increases throughput by (9.6%) on average, while reducing perplexity on WikiText-2 by (0.17%) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by (23.4%) and increasing throughput by (37.4%), while maintaining accuracy within 0.2 percentage points on average. Code is available at https://github.com/ahnselim/GlowQ.
翻译:量化技术(如BitsAndBytes、AWQ和GPTQ)作为部署大语言模型的标准方法被广泛采用,但使用低位表示(例如4比特)时常导致精度下降。低秩校正方法(如LQER、QERA、ASER)被提出以缓解该问题,但这些方法会还原所有层并在每个解码器模块中插入误差校正模块,从而增加延迟和内存开销。为解决此局限,我们提出GlowQ——一种面向量化大语言模型的组共享低秩近似方法。该方法为每个输入共享组缓存单一共享右因子,并仅还原具有最高精度收益的组或层。GlowQ在每个输入共享组内仅计算一次高精度投影,并在其所有模块间复用该投影,从而降低参数与内存开销,同时保留逐层校正式的表达能力。我们还提出选择性变体GlowQ-S,仅在其收益最大的位置应用缓存的共享模块。与强基线方法相比,我们的方法平均将首令牌延迟(TTFB)降低5.6%、吞吐量提升9.6%,同时在WikiText-2数据集上困惑度降低0.17%、下游任务准确率提升0.42个百分点。选择性模型GlowQ-S进一步降低延迟:TTFB降低23.4%、吞吐量提升37.4%,同时准确率平均保持在基准的0.2个百分点以内。代码开源于 https://github.com/ahnselim/GlowQ。