HAS-VQ：面向高保真大语言模型压缩的Hessian自适应稀疏向量量化 (HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression)

Post-training quantization is essential for deploying Large Language Models (LLMs) on resource- constrained devices. However, standard integer quantization (e.g., INT4) fundamentally degrades per- formance by imposing a uniform grid on the heavy-tailed distribution of weight parameters, particularly in smaller-scale models (e.g., <2B parameters). We introduce HAS-VQ (Hessian-Adaptive Sparse Vec- tor Quantization), a compression framework that strictly decouples high-sensitivity outliers from the bulk weight distribution using second-order sensitivity analysis. HAS-VQ employs a Hessian-Masked Decoupling strategy to isolate sensitive parameters, followed by robust Vector Quantization (VQ) of the remaining dense body. Crucially, we introduce a residual sparse feedback mechanism that corrects quan- tization errors in the most sensitive dimensions, ensuring exact reconstruction of outliers. We evaluate HAS-VQ on SmolLM2-1.7B, demonstrating two distinct regimes of superiority: (1) Pareto Dominance over Integer Baselines: At 4.23 effective bits-per-parameter (BPP), we achieve a perplexity of 14.23, significantly outperforming the standard INT4 baseline (20.03 PPL at 4.71 BPP). (2) High-Fidelity Compression: Relative to the FP16 baseline, HAS-VQ achieves a 2.3x reduction in model size (7.03 BPP) while maintaining statistically indistinguishable perplexity (10.12 vs. 10.04), effectively offering a lossless compression alternative for bandwidth-constrained environments. The code is available at https://github.com/VladimerKhasia/HASVQ

翻译：训练后量化对于在资源受限设备上部署大语言模型至关重要。然而，标准整数量化（如INT4）通过对权重参数的厚尾分布施加均匀网格，从根本上降低了模型性能，这在较小规模模型（如<20亿参数）中尤为明显。我们提出HAS-VQ（Hessian自适应稀疏向量量化），这是一种通过二阶敏感度分析严格将高敏感度异常值与主体权重分布解耦的压缩框架。HAS-VQ采用Hessian掩码解耦策略隔离敏感参数，随后对剩余密集主体进行鲁棒的向量量化。关键的是，我们引入了残差稀疏反馈机制，用于修正最敏感维度的量化误差，确保异常值的精确重建。我们在SmolLM2-1.7B模型上评估HAS-VQ，展示了两个明显的优势领域：（1）对整数基线的帕累托优势：在4.23有效比特/参数下，我们实现了14.23的困惑度，显著优于标准INT4基线（4.71比特/参数下为20.03困惑度）。（2）高保真压缩：相对于FP16基线，HAS-VQ在保持统计上无差异的困惑度（10.12对比10.04）的同时，实现了2.3倍的模型尺寸压缩（7.03比特/参数），为带宽受限环境提供了有效的无损压缩方案。代码发布于https://github.com/VladimerKhasia/HASVQ