To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.
翻译:为加速深度神经网络(DNN)推理,低比特宽度的量化方法受到广泛研究。其中一项关键挑战在于,如何在将DNN模型量化为低比特宽度数值时避免显著精度损失,尤其在极低比特宽度(<8位)情况下。本文提出一种名为DyBit的自适应数据表示方法,采用可变长度编码。DyBit能够动态调整不同比特域的精度与范围,从而适配DNN权重/激活值的分布。我们同时提出一种面向硬件的量化框架,结合混合精度加速器,在推理精度与加速比之间实现权衡。实验结果表明,在4比特量化下,DyBit的推理精度较现有最优方法提升1.997%,且所提框架相较原始模型可实现最高8.1倍的加速比。