We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures -- from 1.6B to 70B parameters -- and showcase consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.
翻译:我们提出了一种新颖的选择性模型量化方法,该方法利用熵加权量化(EWQ)超越了针对大型语言模型(LLM)的架构特定和规模依赖压缩方法的局限性。通过分析Transformer块间的熵分布,EWQ能够确定哪些块可以被安全量化而不会导致显著的性能下降,且独立于模型架构或规模。我们的方法优于均匀量化方案,在将内存使用量减少高达18%的同时,将大规模多任务语言理解(MMLU)准确率分数保持在未量化模型的0.5%以内。我们证明了EWQ在多种架构(从16亿到700亿参数)上的有效性,并展示了无论模型规模或架构设计如何,其在质量-压缩权衡方面均能带来一致的改进。EWQ的一个惊人发现是,与未量化模型相比,它能够降低困惑度,这表明通过选择性精度降低存在有益的规则化效应。这一改进在不同模型家族中均成立,表明层级熵与最优精度要求之间存在根本性关系。此外,我们引入了FastEWQ,一种用于熵分布分析的快速方法,无需加载模型权重。该技术利用了在不同架构和规模中持续存在的熵分布的普适性特征,能够实现近乎即时的量化决策,同时在使用完整熵分析时保持80%的分类准确率。我们的结果表明,有效的量化策略可以独立于特定的架构选择或模型规模而开发,这为高效的LLM部署开辟了新的可能性。