Quantization emerges as one of the most promising approaches for deploying advanced deep models on resource-constrained hardware. Mixed-precision quantization leverages multiple bit-width architectures to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization suffers exhaustive search space that causes immense computational overhead. The quantization process thus relies on separate high-performance devices rather than locally, which also leads to a significant gap between the considered hardware metrics and the real deployment.In this paper, we propose an On-chip Hardware-aware Quantization (OHQ) framework that performs hardware-aware mixed-precision quantization without accessing online devices. First, we construct the On-chip Quantization Awareness (OQA) pipeline, enabling perceive the actual efficiency metrics of the quantization operator on the hardware.Second, we propose Mask-guided Quantization Estimation (MQE) technique to efficiently estimate the accuracy metrics of operators under the constraints of on-chip-level computing power.By synthesizing network and hardware insights through linear programming, we obtain optimized bit-width configurations. Notably, the quantization process occurs on-chip entirely without any additional computing devices and data access. We demonstrate accelerated inference after quantization for various architectures and compression ratios, achieving 70% and 73% accuracy for ResNet-18 and MobileNetV3, respectively. OHQ improves latency by 15~30% compared to INT8 on deployment.
翻译:量化被认为是将先进深度模型部署到资源受限硬件上的最有前景的方法之一。混合精度量化通过利用多种位宽架构,释放了量化模型在准确性和效率上的潜力。然而,现有的混合精度量化面临着巨大的搜索空间,导致计算开销巨大。因此,量化过程依赖独立的高性能设备,而非本地执行,这也导致了考虑中的硬件指标与实际部署之间存在显著差距。本文提出了一种芯片级硬件感知量化(OHQ)框架,该框架在无需访问在线设备的情况下实现硬件感知的混合精度量化。首先,我们构建了芯片级量化感知(OQA)流水线,能够感知量化算子在硬件上的实际效率指标。其次,我们提出了掩码引导的量化估计(MQE)技术,在芯片级计算能力的约束下,高效估计算子的准确性指标。通过线性规划综合网络与硬件洞察,我们获得了优化的位宽配置。值得注意的是,量化过程完全在芯片上执行,无需任何额外的计算设备或数据访问。我们展示了量化后多种架构和压缩比率下的推理加速效果,ResNet-18和MobileNetV3的准确率分别达到70%和73%。与INT8部署相比,OHQ将延迟降低了15%~30%。