ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

We present ITQ3_S (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for LLMs integrating TurboQuant (TQ), a rotation-domain strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit methods suffer precision loss from heavy-tailed weight distributions and inter-channel outliers. ITQ3_S pre-rotates the weight space via FWHT before quantization, spreading outlier energy across the vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. We derive a rigorous dequantization procedure fusing a 256-point Inverse FWHT into the CUDA shared-memory loading stage, ensuring reconstruction error is bounded exclusively by the ternary quantization grid with no additional error from the transform inversion. For any weight vector $\mathbf{w} \in \mathbb{R}^{256}$, the reconstruction satisfies $\|\hat{\mathbf{w}} - \mathbf{w}\|_2 \leq ε_q$, strictly smaller than uniform 3-bit baselines that do not exploit rotation-induced distribution normalization. TurboQuant lacks a native CUDA kernel, precluding direct deployment; naively composing TQ with existing weight quantizers introduces domain mismatch errors that accumulate across layers, degrading quality below standard 3-bit baselines. ITQ3_S resolves this by co-designing the FWHT rotation and quantization kernel as a unified pipeline grounded in the IQ3_S weight format, with the inverse transform fused into the CUDA MMQ kernel. Empirically, on the NVIDIA RTX 5090 (Blackwell), ITQ3_S achieves perplexity competitive with FP16 while delivering throughput exceeding 1.5x that of 4-bit alternatives via optimized DP4A and Tensor Core scheduling. Our results establish ITQ3_S as a practical, mathematically grounded solution for high-fidelity LLM deployment on consumer hardware.

翻译：我们提出ITQ3_S（交错三元量化——专用型），一种融合TurboQuant（TQ）的大语言模型新型3位权重量化格式，其中TurboQuant是基于快速沃尔什-哈达玛变换（FWHT）的旋转域策略。传统3位方法因重尾权重分布和跨通道异常值而存在精度损失。ITQ3_S在量化前通过FWHT预旋转权空间，将异常值能量分散至整个向量，诱导出适于均匀三元编码的近高斯分布。我们推导出严格的逆量化流程，将256点逆FWHT融合至CUDA共享内存加载阶段，确保重建误差仅受限于三元量化网格，且不引入变换反演的额外误差。对于任意权重向量$\mathbf{w} \in \mathbb{R}^{256}$，重建满足$\|\hat{\mathbf{w}} - \mathbf{w}\|_2 \leq ε_q$，严格小于未利用旋转诱导分布归一化的均匀3位基线方法。TurboQuant缺乏原生CUDA内核，无法直接部署；若简单将TQ与现有权重量化器组合，会引入跨层累积的域不匹配误差，导致性能低于标准3位基线。ITQ3_S通过将FWHT旋转与量化内核协同设计为基于IQ3_S权重格式的统一流水线，并将逆变换融合至CUDA MMQ内核来解决此问题。实验表明，在NVIDIA RTX 5090（Blackwell）上，ITQ3_S在取得与FP16相当的困惑度的同时，通过优化的DP4A和Tensor Core调度实现超过4位替代方案1.5倍的吞吐量。我们的结果确立了ITQ3_S作为一种实用且数学严谨的方案，可在消费级硬件上实现高保真LLM部署。