Mixed-precision quantization, where a deep neural network's layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the intractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6% reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.
翻译:混合精度量化通过将深度神经网络的不同层量化至不同精度,为优化模型大小、延迟与统计精度之间的权衡提供了可能性,其效果超越同比特宽度量化方法。为在给定网络中导航混合精度配置的复杂搜索空间,本文提出一种混合搜索方法,该方法包括硬件无关的微分搜索算法以及随后的硬件感知启发式优化,旨在寻找针对特定硬件目标进行延迟优化的混合精度配置。我们在MobileNetV1和MobileNetV2上评估了所提算法,并将生成的网络部署至具有不同硬件特性的多核RISC-V微控制器平台系列。在千类ImageNet数据集上,与全精度基线相比,我们在精度损失可忽略的情况下,实现了相比8位模型最高28.6%的端到端延迟降低。即便在缺乏子字节算术硬件支持的系统中,我们仍展示了相对8位基线模型的加速效果,且精度损失可忽略。此外,我们证明了本方法相对于以简化二进制运算次数作为延迟代理指标的微分搜索方法具有优越性。