Quantization is commonly used to compress and accelerate deep neural networks. Quantization assigning the same bit-width to all layers leads to large accuracy degradation at low precision and is wasteful at high precision settings. Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off. Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently. We show that this assumption does not reflect the true behavior of quantized deep neural networks. We propose the first MPQ algorithm that captures the cross-layer dependency of quantization error. Our algorithm (CLADO) enables a fast approximation of pairwise cross-layer error terms by solving linear equations that require only forward evaluations of the network on a small amount of data. Decisions on layerwise bit-width assignments are then determined by optimizing a new MPQ formulation dependent on these cross-layer quantization errors via the Integer Quadratic Program (IQP), which can be solved within seconds. We conduct experiments on multiple networks on the Imagenet dataset and demonstrate an improvement, in top-1 classification accuracy, of up to 27% over uniform precision quantization, and up to 15% over existing MPQ methods.
翻译:量化通常用于压缩和加速深度神经网络。对所有层分配相同位宽的量化会导致低精度时的准确率大幅下降,并在高精度设置时造成资源浪费。混合精度量化(MPQ)通过为不同层分配不同位宽来优化准确率与效率的权衡。现有方法假设不同层的量化误差相互独立,从而简化了MPQ问题。我们证明该假设无法反映量化深度神经网络的真实行为。本文提出首个捕获量化误差跨层依赖性的MPQ算法。我们的算法(CLADO)通过求解线性方程组实现对成对跨层误差项的快速近似,该方程组仅需在网络少量数据上进行前向评估即可求解。基于这些跨层量化误差,我们通过整数二次规划(IQP)优化新的MPQ公式,从而决定逐层位宽分配方案,该优化可在数秒内完成。我们在ImageNet数据集上对多个网络进行实验,结果表明:在Top-1分类准确率上,相比统一精度量化提升最高达27%,相比现有MPQ方法提升最高达15%。