Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axis of improvement unused. As many hardware solutions provide multiple different bit-width settings, mixed-precision quantization has emerged as a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. However, most existing mixed precision algorithms are rather difficult to use for practitioners as they require access to the training data, have many hyper-parameters to tune or even depend on end-to-end retraining of the entire model. In this work, we present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance. Our algorithm requires no hyper-parameter tuning, is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use. We experimentally validate our proposed method on several computer vision tasks, natural language processing tasks and many different networks, and show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
翻译:神经网络量化常用于优化模型的体积、时延和功耗,以便在设备端部署神经网络。在许多情况下,会为整个网络设置一个目标位宽,这意味着每一层都会被量化到相同的位数。然而,对于许多网络而言,某些层对量化噪声的鲁棒性显著强于其他层,这使一个重要的优化维度未被利用。由于众多硬件解决方案提供了多种不同的位宽设置,混合精度量化已成为一种有前景的方法,能够比同构量化实现更优的性能-效率权衡。然而,现有的大多数混合精度算法对从业者来说操作较为困难,因为它们需要访问训练数据、需要调整大量超参数,甚至依赖于对整个模型进行端到端的重新训练。在本工作中,我们提出了一种简单的训练后混合精度算法,仅需一个无标签的小型校准数据集即可自动为每一层选择合适的位宽,以实现理想的设备端性能。我们的算法无需超参数调优,对数据变化具有鲁棒性,并考虑了实际的硬件部署约束,使其成为实用场景的优秀候选。我们通过多项计算机视觉任务、自然语言处理任务以及多种不同网络对所提方法进行了实验验证,结果表明,我们能够找到混合精度网络,其在准确率与效率之间的权衡优于同构位宽网络。