4-bit quantization significantly reduces the memory footprint and accelerates the inference of large language models (LLMs). However, its limited bit-width representation struggles to faithfully capture both dense common values (\emph{inliers}) and rare large-magnitude values (\emph{outliers}), causing substantial accuracy degradation. Existing mixed-precision methods mitigate this by retaining outliers in high precision, but at the cost of breaking the uniformity of low-bit execution, introducing precision conversion and extra data movement that undermine practical speedup. We propose \textbf{MosaicQuant}, a unified 4-bit LLM quantization paradigm built on a novel principle of \emph{inlier--outlier disaggregation}. Rather than elevating outlier precision, MosaicQuant quantizes the full weight matrix into a dense 4-bit base component, where inliers are captured faithfully while outlier are inevitably quantized. A sparse 4-bit residual component is then introduced to compensate for these quantization errors, selectively targeting the most error-critical weight blocks where output distortion is shown to be concentrated. However, a unified representation alone is insufficient, as naïvely executing the sparse residual as a separate kernel still breaks the unified low-bit inference pipeline. To bridge this gap, we introduce \textbf{ZipperEngine}, which fuses sparse block computation into the dense 4-bit GEMM kernel via an overlapped pipeline, unifying not only the representation but also the execution into a single coherent low-bit inference pipeline. Extensive experiments on LLaMA3 and Qwen3 demonstrate that MosaicQuant preserves near-FP16 accuracy while achieving up to $1.24\times$ speedup over the W16A16 baseline.
翻译:4比特量化显著降低了大型语言模型(LLMs)的内存占用并加速了推理过程。然而,其有限的比特宽度表示难以同时精确捕获密集的常见数值(正常值)和罕见的大幅值(离群值),导致显著的精度损失。现有混合精度方法通过将离群值保留在高精度格式中来缓解此问题,但这破坏了低比特计算的统一性,引入精度转换和额外数据移动,从而削弱了实际加速效果。我们提出**MosaicQuant**,一种基于新颖的"正常值-离群值分离"原则的统一4比特LLM量化范式。MosaicQuant并非提升离群值精度,而是将完整权重矩阵量化成密集的4比特基分量,其中正常值被精确捕获,而离群值则不可避免地经历量化。随后引入稀疏的4比特残差分量来补偿这些量化误差,有针对性地选择输出失真最严重的误差关键权重块。然而,仅靠统一表示并不足够,因为将稀疏残差作为独立核执行仍会破坏统一的低比特推理管线。为弥补这一差距,我们引入**ZipperEngine**,通过重叠流水线将稀疏块计算融合到密集4比特GEMM核中,不仅实现了表示的统一,更将执行过程整合为单一连贯的低比特推理管线。在LLaMA3和Qwen3上的大量实验表明,MosaicQuant在保持近FP16精度的同时,相较于W16A16基线实现了高达$1.24\times$的加速比。