Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CALM (A CKA-guided Adaptive Layer-wise Modularization)a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. CALM independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMsincluding LLaMA and Qwenin terms of perplexity (PPL) and downstream task performance.
翻译:当前主流的大语言模型训练后量化方法通常对所有网络层采用统一的量化策略,忽视了不同层间算法适用性的显著差异。为解决这一局限性,我们提出了CALM(一种基于CKA引导的自适应分层模块化)——一种无需微调、即插即用的算法异构量化框架。CALM独立评估每个层上的多种PTQ算法,并采用线性中心核对齐作为度量指标,自动为每层选择最优量化策略。随后将这些独立优化的策略整合以构建混合量化模型。实验表明,在包括LLaMA和Qwen在内的主流大语言模型上,我们的方法在困惑度和下游任务性能方面均持续优于统一量化基线及最先进的混合精度方法。