On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Low-bit quantization emerges as one of the most promising compression approaches for deploying deep neural networks on edge devices. Mixed-precision quantization leverages a mixture of bit-widths to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization methods rely on simulations in high-performance devices to achieve accuracy and efficiency trade-offs in immense search spaces. This leads to a non-negligible gap between the estimated efficiency metrics and the actual hardware that makes quantized models far away from the optimal accuracy and efficiency, and also causes the quantization process to rely on additional high-performance devices. In this paper, we propose an On-Chip Hardware-Aware Quantization (OHQ) framework, performing hardware-aware mixed-precision quantization on deployed edge devices to achieve accurate and efficient computing. Specifically, for efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator and avoid optimization errors caused by inaccurate simulation. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario, getting rid of the dependence of the quantization process on high computing power. By synthesizing insights from quantized models and hardware through linear optimization, we can obtain optimized bit-width configurations to achieve outstanding performance on accuracy and efficiency. We evaluate inference accuracy and acceleration with quantization for various architectures and compression ratios on hardware. OHQ achieves 70% and 73% accuracy for ResNet-18 and MobileNetV3, respectively, and can reduce latency by 15~30% compared to INT8 on real deployment.

翻译：低比特量化已成为在边缘设备上部署深度神经网络最具前景的压缩方法之一。混合精度量化通过采用混合比特位宽，充分释放量化模型的精度与效率潜力。然而，现有混合精度量化方法依赖在高性能设备上进行仿真，以在巨大的搜索空间中实现精度与效率的权衡。这导致预估的效率指标与实际硬件之间存在不可忽视的差距，使得量化模型难以达到最优的精度与效率，同时也导致量化过程依赖于额外的高性能设备。本文提出一种片上硬件感知量化（OHQ）框架，在已部署的边缘设备上执行硬件感知的混合精度量化，以实现精确高效的计算。具体而言，针对效率指标，我们构建了片上量化感知流水线，使量化过程能够感知量化算子的实际硬件效率，避免因仿真不准确导致的优化偏差。针对精度指标，我们提出掩码引导量化估计技术，有效评估片上场景中算子对精度的影响，摆脱量化过程对高算力的依赖。通过线性优化综合量化模型与硬件的洞察，我们可以获得优化的比特位宽配置，从而在精度和效率上实现卓越性能。我们在硬件上针对不同架构和压缩比评估了量化后的推理精度与加速效果。OHQ在ResNet-18和MobileNetV3上分别实现了70%和73%的准确率，在实际部署中相比INT8可降低15~30%的延迟。