Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size

We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures-from 1.6B to 70B parameters-showcasing consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.

翻译：我们提出了一种新颖的选择性模型量化方法，该方法通过熵加权量化（EWQ）突破了针对大型语言模型（LLMs）的架构特定和规模依赖压缩方法的局限性。通过分析Transformer块间的熵分布，EWQ能够独立于模型架构或规模，确定哪些模块可以安全量化而不会导致显著的性能下降。我们的方法优于均匀量化方案，在将内存使用量降低高达18%的同时，使大规模多任务语言理解（MMLU）准确率分数保持在未量化模型的0.5%以内。我们在从1.6B到70B参数的多重架构上验证了EWQ的有效性，结果表明无论模型规模或架构设计如何，该方法在质量-压缩权衡方面均能带来持续改进。EWQ的一个意外发现是其能够降低相较于未量化模型的困惑度，这表明通过选择性精度降低可能存在有益的 regularization 效应。这一改进在不同模型家族中均成立，揭示了层级熵与最优精度需求之间的基本关系。此外，我们提出了FastEWQ——一种无需加载模型权重的快速熵分布分析方法。该技术利用了在不同架构和规模间持续存在的熵分布普适特性，能够在保持与完整熵分析80%分类准确率的同时，实现近乎即时的量化决策。我们的研究结果表明，有效的量化策略可以独立于特定架构选择或模型规模进行开发，这为高效部署LLMs开辟了新的可能性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日