Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU, DPU), deploying deep learning models on edge devices with fixed-point hardware is still challenging due to complex model quantization and conversion. Existing model quantization frameworks like Tensorflow QAT [1], TFLite PTQ [2], and Qualcomm AIMET [3] supports only a limited set of quantization schemes (e.g., only asymmetric per-tensor quantization in TF1.x QAT [4]). Accordingly, deep learning models cannot be easily quantized for diverse fixed-point hardwares, mainly due to slightly different quantization requirements. In this paper, we envision a new type of model quantization approach called MRQ (model re-quantization), which takes existing quantized models and quickly transforms the models to meet different quantization requirements (e.g., asymmetric -> symmetric, non-power-of-2 scale -> power-of-2 scale). Re-quantization is much simpler than quantizing from scratch because it avoids costly re-training and provides support for multiple quantization schemes simultaneously. To minimize re-quantization error, we developed a new set of re-quantization algorithms including weight correction and rounding error folding. We have demonstrated that MobileNetV2 QAT model [7] can be quickly re-quantized into two different quantization schemes (i.e., symmetric and symmetric+power-of-2 scale) with less than 0.64 units of accuracy loss. We believe our work is the first to leverage this concept of re-quantization for model quantization and models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.
翻译:尽管多种硬件加速器(如NPU、TPU、DPU)已广泛普及,但由于模型量化和转换的复杂性,在配备定点硬件的边缘设备上部署深度学习模型仍具挑战。现有的模型量化框架(如Tensorflow QAT [1]、TFLite PTQ [2]和Qualcomm AIMET [3])仅支持有限的量化方案(例如TF1.x QAT [4]仅支持非对称逐张量量化)。因此,深度学习模型难以针对多样化的定点硬件进行轻松量化,主要原因是不同硬件的量化需求存在细微差异。本文提出一种新型模型量化方法——MRQ(模型重量化),该方法利用现有量化模型,快速将其转换为满足不同量化需求(例如非对称→对称、非2的幂次缩放因子→2的幂次缩放因子)。重量化比从头量化简单得多,因为它避免了昂贵的重新训练,并能同时支持多种量化方案。为最小化重量化误差,我们开发了一套新型重量化算法,包括权重校正和舍入误差折叠。我们证明,MobileNetV2 QAT模型 [7] 可快速重量化为两种不同量化方案(即对称方案和对称+2的幂次缩放因子方案),准确率损失低于0.64个单位。我们相信,本文首次利用重量化概念进行模型量化,且通过重量化过程获得的模型已成功部署于Echo Show设备的NNA上。