For effective and efficient deep neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. It is generally desirable to quantize as aggressively as possible without incurring significant accuracy degradation. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers of a network to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50 and ResNet-101 classification networks, demonstrating improved performance across the entire accuracy-throughput frontier, and equivalent performance for the PSPNet segmentation network in our own commensurate comparison over leading mixed precision layer selection techniques, while requiring orders of magnitude less compute time to reach a solution.
翻译:为实现高效且节能的深度神经网络推理,理想方案是采用最简网络达到先进精度,同时最小化计算量、内存占用与能耗。将网络量化为低精度是简化网络的有效技术,通常希望在不对精度造成显著下降的前提下尽可能激进地量化。由于网络各层对量化的敏感度不同,混合精度量化方法通过选择性调整各层精度,以最小化任务性能(如准确率)的损失。为评估层精度选择对任务性能的影响,本文提出两种方法:i) 基于熵近似的引导层选择(EAGL)利用权重分布的熵实现快速量化;ii) 精度感知层选择(ALPS)通过单周期微调直接评估层精度降低后的影响。采用EAGL与ALPS进行层精度选择后,ResNet-50和ResNet-101分类网络在混合使用4位与2位层时恢复了全精度准确率,在完整精度-吞吐量前沿上展现出更优性能;在PSPNet分割网络的同类对比中,相对于主流混合精度层选择技术,本方法在达到同等性能的同时,计算耗时降低数个数量级。