Compression of deep neural networks has become a necessary stage for optimizing model inference on resource-constrained hardware. This paper presents FITCompress, a method for unifying layer-wise mixed precision quantization and pruning under a single heuristic, as an alternative to neural architecture search and Bayesian-based techniques. FITCompress combines the Fisher Information Metric, and path planning through compression space, to pick optimal configurations given size and operation constraints with single-shot fine-tuning. Experiments on ImageNet validate the method and show that our approach yields a better trade-off between accuracy and efficiency when compared to the baselines. Besides computer vision benchmarks, we experiment with the BERT model on a language understanding task, paving the way towards its optimal compression.
翻译:深度神经网络压缩已成为在资源受限硬件上优化模型推理的必要环节。本文提出FITCompress方法,将逐层混合精度量化与剪枝整合于单一启发式框架中,作为神经架构搜索和贝叶斯方法的替代方案。FITCompress结合Fisher信息度量与压缩空间路径规划,在单次微调下,根据规模与运算约束选择最优配置。在ImageNet上的实验验证了该方法,结果表明,与基线方法相比,我们的方法在准确率与效率之间实现了更优的权衡。除计算机视觉基准测试外,我们还在语言理解任务上对BERT模型进行了实验,为其最优压缩铺平了道路。