Quantization is commonly used in Deep Neural Networks (DNNs) to reduce the storage and computational complexity by decreasing the arithmetical precision of activations and weights, a.k.a. tensors. Efficient hardware architectures employ linear quantization to enable the deployment of recent DNNs onto embedded systems and mobile devices. However, linear uniform quantization cannot usually reduce the numerical precision to less than 8 bits without sacrificing high performance in terms of model accuracy. The performance loss is due to the fact that tensors do not follow uniform distributions. In this paper, we show that a significant amount of tensors fit into an exponential distribution. Then, we propose DNA-TEQ to exponentially quantize DNN tensors with an adaptive scheme that achieves the best trade-off between numerical precision and accuracy loss. The experimental results show that DNA-TEQ provides a much lower quantization bit-width compared to previous proposals, resulting in an average compression ratio of 40% over the linear INT8 baseline, with negligible accuracy loss and without retraining the DNNs. Besides, DNA-TEQ leads the way in performing dot-product operations in the exponential domain, which saves 66% of energy consumption on average for a set of widely used DNNs.
翻译:量化技术通过降低激活值与权重(即张量)的算术精度,在深度神经网络中广泛用于减少存储与计算复杂度。高效硬件架构采用线性量化将现代DNN部署至嵌入式系统和移动设备。然而,线性均匀量化通常无法在不牺牲模型准确率性能的前提下将数值精度降至8比特以下,其性能损失源于张量分布不符合均匀分布特性。本文证明,大量张量服从指数分布。为此,我们提出DNA-TEQ方法,通过自适应机制对DNN张量进行指数量化,在数值精度与准确率损失之间实现最优平衡。实验结果表明,与现有方案相比,DNA-TEQ能够实现更低的量化位宽,相较于线性INT8基准方法平均压缩比达40%,且几乎不造成准确率损失且无需重训练DNN。此外,DNA-TEQ率先在指数域执行点积运算,对于一组广泛使用的DNN,平均可节省66%的能耗。