Quantizing neural networks is one of the most effective methods for achieving efficient inference on mobile and embedded devices. In particular, mixed precision quantized (MPQ) networks, whose layers can be quantized to different bitwidths, achieve better task performance for the same resource constraint compared to networks with homogeneous bitwidths. However, finding the optimal bitwidth allocation is a challenging problem as the search space grows exponentially with the number of layers in the network. In this paper, we propose QBitOpt, a novel algorithm for updating bitwidths during quantization-aware training (QAT). We formulate the bitwidth allocation problem as a constraint optimization problem. By combining fast-to-compute sensitivities with efficient solvers during QAT, QBitOpt can produce mixed-precision networks with high task performance guaranteed to satisfy strict resource constraints. This contrasts with existing mixed-precision methods that learn bitwidths using gradients and cannot provide such guarantees. We evaluate QBitOpt on ImageNet and confirm that we outperform existing fixed and mixed-precision methods under average bitwidth constraints commonly found in the literature.
翻译:量化神经网络是在移动和嵌入式设备上实现高效推理的最有效方法之一。特别是混合精度量化(MPQ)网络,其各层可量化至不同位宽,在相同资源约束下相比于同构位宽网络能取得更优的任务性能。然而,寻找最优位宽分配极具挑战性,因为搜索空间随网络层数呈指数级增长。本文提出QBitOpt算法,一种在量化感知训练(QAT)过程中更新位宽的新方法。我们将位宽分配问题形式化为约束优化问题。通过在QAT过程中结合快速计算敏感度与高效求解器,QBitOpt能够生成在严格满足资源约束条件下具有高任务性能的混合精度网络。这与现有使用梯度学习位宽且无法提供此类保证的混合精度方法形成鲜明对比。我们在ImageNet上评估QBitOpt,并确认在文献中常见的平均位宽约束下,其性能优于现有固定精度和混合精度方法。