Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size

We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures -- from 1.6B to 70B parameters -- and showcase consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.

翻译：我们提出了一种新颖的选择性模型量化方法，该方法利用熵加权量化（EWQ）超越了针对大型语言模型（LLM）的架构特定和规模依赖压缩方法的局限性。通过分析Transformer块间的熵分布，EWQ能够确定哪些块可以被安全量化而不会导致显著的性能下降，且独立于模型架构或规模。我们的方法优于均匀量化方案，在将内存使用量减少高达18%的同时，将大规模多任务语言理解（MMLU）准确率分数保持在未量化模型的0.5%以内。我们证明了EWQ在多种架构（从16亿到700亿参数）上的有效性，并展示了无论模型规模或架构设计如何，其在质量-压缩权衡方面均能带来一致的改进。EWQ的一个惊人发现是，与未量化模型相比，它能够降低困惑度，这表明通过选择性精度降低存在有益的规则化效应。这一改进在不同模型家族中均成立，表明层级熵与最优精度要求之间存在根本性关系。此外，我们引入了FastEWQ，一种用于熵分布分析的快速方法，无需加载模型权重。该技术利用了在不同架构和规模中持续存在的熵分布的普适性特征，能够实现近乎即时的量化决策，同时在使用完整熵分析时保持80%的分类准确率。我们的结果表明，有效的量化策略可以独立于特定的架构选择或模型规模而开发，这为高效的LLM部署开辟了新的可能性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日