Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely, mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically, in the first stage, all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However, our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization, leading to considerable performance degradation under a high compression ratio. To tackle this problem, we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training, to ensure the rest bit-widths converged properly. Then, taking inspiration from information theory, we present an information distortion mitigation technique to align the behavior of the bad-performing bit-widths to the well-performing ones. In the second stage, an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method. Code can be available on \href{https://www.github.com/1hunters/retraining-free-quantization}{https://github.com/1hunters/retraining-free-quantization}.
翻译:量化对于压缩过参数化的深度神经网络模型并将其部署在资源受限设备上具有重要意义。固定精度量化因数值表示能力有限而存在性能下降问题。相反,混合精度量化通过为不同层分配异构位宽,能有效压缩模型。混合精度量化通常采用搜索-重训练的两阶段流程。本文设计了一种用于混合精度模型压缩的单次训练-搜索范式。具体而言,在第一阶段,所有可能的位宽配置通过一组共享权重进行耦合优化。然而,我们观察到优化过程中高度耦合的权重间存在先前未被发现的严重位宽干扰现象,导致高压缩比下性能显著下降。为解决该问题,我们首先设计了一种位宽调度器,在训练过程中动态冻结波动最剧烈的层位宽,以确保其余位宽正常收敛。随后,受信息论启发,我们提出信息失真缓解技术,将性能较差位宽的行为与性能优良位宽对齐。在第二阶段,设计了一种仅推理的贪婪搜索方案,无需额外训练成本即可评估配置的优劣。在三个代表性模型和三个数据集上的大量实验证明了所提方法的有效性。代码发布于 \href{https://www.github.com/1hunters/retraining-free-quantization}{https://github.com/1hunters/retraining-free-quantization}。