Compressing large language models (LLMs) for deployment on commodity GPUs remains challenging: conventional scalar quantization is limited to fixed bit-widths (e.g., 8/4/3-bit), offers only a few discrete compression points, and typically requires calibration data. We present FASQ (Flexible Accelerated Subspace Quantization), a calibration-free framework that applies product quantization to LLM weight matrices. By tuning two parameters, sub-vector size and codebook cardinality, FASQ exposes a continuous design space spanning 27-49% of the original FP16 model size, filling compression gaps that fixed-bit schemes cannot reach. On Meta-Llama-3-8B, FASQ surpasses 4-bit GPTQ and AWQ in accuracy (67.1-67.7 avg.) at 37-42% model size, with consistent results on Qwen3-8B and Qwen3.5-9B-Base. To make product quantization practical at inference time, we design custom CUDA kernels: a LUT-free direct-compute GEMV for decode and an output-stationary double-buffered LUT GEMM for prefill, both with split-K parallelism. On an RTX~3090, FASQ achieves 45.2 tok/s decode at effective 4-bit (2.56x memory reduction) and 51.8 tok/s at effective 3-bit (2.80x), both surpassing FP16 tensor-core performance (43.9 tok/s) and delivering 1.6 to 1.8x the throughput of AWQ, 2.5 to 2.5x of GPTQ, and 4.3 to 5x of RTN. FASQ is the only compressed method that accelerates decode beyond FP16, offering calibration-free compression, continuous size-quality trade-offs, and real-time inference on a single consumer GPU.
翻译:将大语言模型(LLMs)部署到商用GPU上仍面临挑战:传统标量量化受限于固定位宽(如8/4/3比特),仅提供若干离散压缩点,且通常需要校准数据。本文提出FASQ(灵活加速子空间量化),一种免校准框架,将乘积量化应用于LLM权重矩阵。通过调节两个参数(子向量大小和码本基数),FASQ可覆盖原始FP16模型大小27%-49%的连续设计空间,填补了固定位宽方案无法达到的压缩区间。在Meta-Llama-3-8B上,FASQ在模型大小压缩至37%-42%时,准确率(平均67.1-67.7)超越4比特GPTQ和AWQ,且在Qwen3-8B和Qwen3.5-9B-Base上表现一致。为使乘积量化在推理时实用,我们设计了自定义CUDA核:用于解码阶段的免查找表直接计算GEMV,以及用于预填充阶段的输出驻留双缓冲查找表GEMM,两者均采用分裂K并行。在RTX~3090上,FASQ在有效4比特(内存压缩2.56倍)下实现45.2 tok/s解码速度,有效3比特(压缩2.80倍)下达到51.8 tok/s,均超越FP16张量核心性能(43.9 tok/s),吞吐量达AWQ的1.6至1.8倍、GPTQ的2.5倍、RTN的4.3至5倍。FASQ是唯一能实现超越FP16解码速度的压缩方法,提供免校准压缩、连续的大小-质量权衡,以及单张消费级GPU上的实时推理能力。