Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources. In this study, we investigate the rate-distortion tradeoff for NN model compression. First, we suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model, yielding a different rate at each layer, i.e., mixed-precision quantization. Then, we prove that our rotation-invariant approach is optimal in terms of compression. We rigorously evaluate RIQ and demonstrate its capabilities on various models and tasks. For example, RIQ facilitates $\times 19.4$ and $\times 52.9$ compression ratios on pre-trained VGG dense and pruned models, respectively, with $<0.4\%$ accuracy degradation. Code is available in \href{https://github.com/ehaleva/RIQ}{github.com/ehaleva/RIQ}.
翻译:后训练神经网络模型压缩是一种极具吸引力的方法,可在内存资源受限的设备上部署大规模、高内存消耗的模型。本研究深入探讨了神经网络模型压缩中的率失真权衡问题。首先,我们提出一种旋转不变量化技术,该方法仅使用单一参数对整个神经网络模型进行量化,并在各层产生不同的量化比特率,即实现混合精度量化。随后,我们通过理论证明该旋转不变方法在压缩性能上达到最优。我们在多种模型与任务上对RIQ进行了严格评估,并验证了其有效性。例如,在预训练的VGG密集模型与剪枝模型上,RIQ分别实现了19.4倍与52.9倍的压缩比,且精度损失小于0.4%。相关代码已发布于\href{https://github.com/ehaleva/RIQ}{github.com/ehaleva/RIQ}。