SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

Efficient large language model (LLM) serving is increasingly constrained by deployment cost. Quantization is a key technique for reducing serving cost, yet even state-of-the-art 4-bit quantizers exhibit a noticeable quality gap from FP16, particularly for smaller models where low-bit serving is most beneficial. We identify a fundamental cause of this gap: quantization error is highly input-dependent and varies substantially across tokens, while existing post-quantization compensation methods are static and apply identical corrections to all inputs. As a result, easy tokens are over-corrected while hard tokens remain under-corrected. We present SPEAR, a system for post-quantization error-adaptive recovery that improves low-bit LLM serving. SPEAR introduces lightweight Error Compensators (ECs) modulated by per-token gates and places them only at the most error-sensitive layers identified through a CKA-guided entropy-aware diagnostic. This focuses a small parameter budget where it is most effective. Efficient deployment of ECs presents several systems challenges, including additional computation, tensor-parallel synchronization caused by input-dependent gating, and latency instability across configurations. SPEAR addresses these issues through adaptive kernel-fusion dispatch, combining an epilogue-integrated peer-reduction kernel with P2P dual-write to fuse the post-EC computation into low-bit GEMMs, and an SLO-constrained EC-aware scheduler for predictable serving performance. Across challenging per-channel quantization settings, SPEAR recovers 56-75% of the perplexity gap between W4 and FP16 while adding less than 1% model memory overhead and maintaining latency comparable to a widely used 4-bit serving deployment.

翻译：高效的大语言模型服务正日益受制于部署成本。量化是降低服务成本的关键技术，但即便最先进的4比特量化器与FP16相比仍存在显著的质量差距，这种情况在低比特服务最为受益的小型模型中尤为突出。我们发现这一差距的根本原因在于：量化误差具有高度输入依赖性且各标记间差异显著，而现有后量化补偿方法均为静态处理，对所有输入施加相同的校正。这导致简单标记被过度校正，而困难标记仍校正不足。本文提出SPEAR——一种后量化误差自适应恢复系统，用于改进低比特大语言模型服务。SPEAR引入由逐标记门控调节的轻量级误差补偿器，并将其仅部署在通过基于CKA的熵感知诊断识别出的最敏感层上，从而将少量参数预算集中到最有效的区域。误差补偿器的高效部署面临多项系统挑战，包括额外计算量、输入依赖门控引起的张量并行同步问题，以及不同配置下的延迟不稳定性。SPEAR通过自适应核融合调度应对这些问题：将后补偿计算与低比特通用矩阵乘法融合，采用结合P2P双写的尾声集成对等规约核，并设计满足服务等级协议约束的补偿感知调度器以实现可预测的服务性能。在具有挑战性的逐通道量化设置中，SPEAR恢复了W4与FP16之间56-75%的困惑度差距，同时模型内存开销增加不足1%，延迟与广泛使用的4比特服务部署方案相当。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

专知会员服务

7+阅读 · 5月12日

【AAAI2026】URaG：面向高效长文档理解的多模态大语言模型统一检索与生成框架

专知会员服务

15+阅读 · 2025年11月14日

【博士论文】《通过提前退出算法加速大语言模型推理》

专知会员服务

13+阅读 · 2025年9月9日

LaCache：用于高效长上下文建模的大语言模型梯状KV缓存机制

专知会员服务

11+阅读 · 2025年7月23日