Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by (5.6%) and increases throughput by (9.6%) on average, while reducing perplexity on WikiText-2 by (0.17%) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by (23.4%) and increasing throughput by (37.4%), while maintaining accuracy within 0.2 percentage points on average.
翻译:BitsAndBytes、AWQ和GPTQ等量化技术被广泛用作部署大语言模型的标准方法,但在使用低比特表示(如4比特)时往往会降低精度。低秩校正方法(如LQER、QERA、ASER)被提出以缓解该问题,然而它们会恢复所有层并在每个解码器模块中插入纠错模块,从而增加延迟和内存开销。为解决这一局限性,我们提出GlowQ——一种面向量化大语言模型的组共享低秩近似方法,该方法为每个输入共享组缓存单个共享右因子,并仅恢复对精度提升最大的组或层。GlowQ为每个输入共享组计算一次高精度投影,并在其模块间复用它,从而降低参数和内存开销,同时保留层特异性校正的表达能力。我们还提出一种选择性变体GlowQ-S,仅将缓存的共享模块应用于能产生最大收益的位置。与强基线相比,我们的方法平均将首令牌延迟降低5.6%,吞吐量提升9.6%,同时将WikiText-2的困惑度降低0.17%,下游任务准确率提升0.42个百分点。选择性模型GlowQ-S进一步降低延迟,使首令牌延迟减少23.4%,吞吐量提高37.4%,同时平均保持准确率偏差在0.2个百分点以内。