We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures-from 1.6B to 70B parameters-showcasing consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.
翻译:我们提出了一种新颖的选择性模型量化方法,该方法通过熵加权量化(EWQ)突破了针对大型语言模型(LLMs)的架构特定和规模依赖压缩方法的局限性。通过分析Transformer块间的熵分布,EWQ能够独立于模型架构或规模,确定哪些模块可以安全量化而不会导致显著的性能下降。我们的方法优于均匀量化方案,在将内存使用量降低高达18%的同时,使大规模多任务语言理解(MMLU)准确率分数保持在未量化模型的0.5%以内。我们在从1.6B到70B参数的多重架构上验证了EWQ的有效性,结果表明无论模型规模或架构设计如何,该方法在质量-压缩权衡方面均能带来持续改进。EWQ的一个意外发现是其能够降低相较于未量化模型的困惑度,这表明通过选择性精度降低可能存在有益的 regularization 效应。这一改进在不同模型家族中均成立,揭示了层级熵与最优精度需求之间的基本关系。此外,我们提出了FastEWQ——一种无需加载模型权重的快速熵分布分析方法。该技术利用了在不同架构和规模间持续存在的熵分布普适特性,能够在保持与完整熵分析80%分类准确率的同时,实现近乎即时的量化决策。我们的研究结果表明,有效的量化策略可以独立于特定架构选择或模型规模进行开发,这为高效部署LLMs开辟了新的可能性。