Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive searches for the best hyper-parameters. However, these expensive operations are impractical on devices with limited computation capabilities, memory capacities, and energy budgets. Furthermore, many existing methods do not focus on sub-6-bit (or deep) quantization. To fill these gaps, in this paper we propose DQA (Deep Quantization of DNN Activations), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and Huffman coding to be efficient and achieve high accuracy. We evaluate DQA with 3, 4, and 5-bit quantization levels and three different DNN models for two different tasks, image classification and image segmentation, on two different datasets. DQA shows significantly better accuracy (up to 29.28%) compared to the direct quantization method and the state-of-the-art NoisyQuant for sub-6-bit quantization.
翻译:深度神经网络(DNN)激活值的量化是一种常用的技术,旨在降低DNN推理过程中的计算和内存需求,这对于资源受限的设备尤其有益。为了实现高精度,现有的激活值量化方法依赖于复杂的数学计算或进行大量搜索以寻找最佳超参数。然而,这些昂贵的操作在计算能力、内存容量和能量预算有限的设备上是不切实际的。此外,许多现有方法并未专注于低于6位(即深度)的量化。为了填补这些空白,本文提出DQA(深度神经网络激活值的深度量化),这是一种专注于激活值低于6位量化的新方法,它利用简单的基于移位的操作和霍夫曼编码来实现高效性并达到高精度。我们在两个不同的数据集上,针对图像分类和图像分割两项任务,使用3、4和5位量化级别以及三种不同的DNN模型对DQA进行了评估。与直接量化方法以及低于6位量化的最先进方法NoisyQuant相比,DQA显示出显著更高的精度(最高达29.28%)。