Retraining-free Model Quantization via One-Shot Weight-Coupling Learning

Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely, mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically, in the first stage, all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However, our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization, leading to considerable performance degradation under a high compression ratio. To tackle this problem, we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training, to ensure the rest bit-widths converged properly. Then, taking inspiration from information theory, we present an information distortion mitigation technique to align the behavior of the bad-performing bit-widths to the well-performing ones. In the second stage, an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method. Code can be available on \href{https://www.github.com/1hunters/retraining-free-quantization}{https://github.com/1hunters/retraining-free-quantization}.

翻译：量化对于压缩过参数化的深度神经网络模型并将其部署在资源受限设备上具有重要意义。固定精度量化因数值表示能力有限而存在性能下降问题。相反，混合精度量化通过为不同层分配异构位宽，能有效压缩模型。混合精度量化通常采用搜索-重训练的两阶段流程。本文设计了一种用于混合精度模型压缩的单次训练-搜索范式。具体而言，在第一阶段，所有可能的位宽配置通过一组共享权重进行耦合优化。然而，我们观察到优化过程中高度耦合的权重间存在先前未被发现的严重位宽干扰现象，导致高压缩比下性能显著下降。为解决该问题，我们首先设计了一种位宽调度器，在训练过程中动态冻结波动最剧烈的层位宽，以确保其余位宽正常收敛。随后，受信息论启发，我们提出信息失真缓解技术，将性能较差位宽的行为与性能优良位宽对齐。在第二阶段，设计了一种仅推理的贪婪搜索方案，无需额外训练成本即可评估配置的优劣。在三个代表性模型和三个数据集上的大量实验证明了所提方法的有效性。代码发布于 \href{https://www.github.com/1hunters/retraining-free-quantization}{https://github.com/1hunters/retraining-free-quantization}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日