4-bit quantization reduces the memory footprint and latency of large language model inference, but its aggressive precision reduction can severely degrade accuracy. Prior methods address this by decomposing each weight matrix into two components (e.g., via singular value decomposition) and quantizing them separately, assigning the bulk of values to a low-precision residual component while handling outliers with a high-precision low-rank component. However, such decompositions are designed to minimize the real-valued energy of the residual, rather than the post-quantization error of the residual and low-rank components. We propose TwinQuant, a 4-bit quantization framework that learns quantization-friendly decomposed subspaces and jointly reshapes both the low-rank and residual components. TwinQuant learns component-specific transformations via a joint optimization over the Stiefel and general linear manifolds, flattening their distributions and reducing dynamic-range imbalance. To enable efficient end-to-end execution, we further design a fused dual-component kernel that pipelines the two-stage low-rank computation on-chip and merges both components with a single epilogue, avoiding intermediate global-memory traffic. Across LLaMA3 and Qwen3 models, TwinQuant preserves near-FP16 accuracy and delivers up to $1.8\times$ end-to-end speedup over an FP16 baseline.
翻译:4比特量化降低了大型语言模型推理的内存占用和延迟,但其激进的精度压缩会严重损害准确性。现有方法通过将每个权重矩阵分解为两个分量(例如通过奇异值分解),并分别进行量化来解决该问题:将大部分数值分配给低精度残差分量,同时用高精度低秩分量处理离群值。然而,此类分解旨在最小化残差的实值能量,而非残差与低秩分量的后量化误差。我们提出TwinQuant——一种4比特量化框架,该框架可学习利于量化的分解子空间,并协同重塑低秩与残差分量。TwinQuant通过对施蒂费尔流形与一般线性流形进行联合优化,学习分量专属变换,从而平滑其数值分布并降低动态范围不均衡。为实现高效端到端执行,我们进一步设计了融合双分量内核,该内核在芯片上流水线化处理两阶段低秩计算,并通过单个后处理阶段合并两个分量,避免了中间全局内存流量。在LLaMA3与Qwen3模型上,TwinQuant保持了接近FP16的精度,并相较于FP16基线实现了最高1.8倍的端到端加速。