It is critical to deploy complicated neural network models on hardware with limited resources. This paper proposes a novel model quantization method, named the Low-Cost Proxy-Based Adaptive Mixed-Precision Model Quantization (LCPAQ), which contains three key modules. The hardware-aware module is designed by considering the hardware limitations, while an adaptive mixed-precision quantization module is developed to evaluate the quantization sensitivity by using the Hessian matrix and Pareto frontier techniques. Integer linear programming is used to fine-tune the quantization across different layers. Then the low-cost proxy neural architecture search module efficiently explores the ideal quantization hyperparameters. Experiments on the ImageNet demonstrate that the proposed LCPAQ achieves comparable or superior quantization accuracy to existing mixed-precision models. Notably, LCPAQ achieves 1/200 of the search time compared with existing methods, which provides a shortcut in practical quantization use for resource-limited devices.
翻译:在资源受限的硬件上部署复杂的神经网络模型至关重要。本文提出一种新颖的模型量化方法,名为基于低成本代理的自适应混合精度模型量化(LCPAQ),该方法包含三个关键模块。硬件感知模块考虑了硬件限制因素,而自适应混合精度量化模块则利用海森矩阵和帕累托前沿技术评估量化敏感性。整数线性规划用于跨不同层次精细调整量化策略。随后,低成本代理神经架构搜索模块高效探索最优化超参数。基于ImageNet的实验表明,所提出的LCPAQ方法能够达到与现有混合精度模型相当或更优的量化精度。值得注意的是,与现有方法相比,LCPAQ仅需其搜索时间的1/200,为资源受限设备的实际量化应用提供了捷径。