Inference time, model size, and accuracy are critical for deploying deep neural network models. Numerous research efforts have been made to compress neural network models with faster inference and higher accuracy. Pruning and quantization are mainstream methods to this end. During model quantization, converting individual float values of layer weights to low-precision ones can substantially reduce the computational overhead and improve the inference speed. Many quantization methods have been studied, for example, vector quantization, low-bit quantization, and binary/ternary quantization. This survey focuses on ternary quantization. We review the evolution of ternary quantization and investigate the relationships among existing ternary quantization methods from the perspective of projection function and optimization methods.
翻译:推理时间、模型大小和精度是部署深度神经网络模型的关键因素。大量研究工作致力于压缩神经网络模型,以实现更快的推理速度和更高的精度。剪枝和量化是实现这一目标的主流方法。在模型量化过程中,将层权重的单个浮点值转换为低精度值可以显著降低计算开销并提高推理速度。已有多种量化方法被研究,例如向量量化、低位量化和二值/三元量化。本综述聚焦于三元量化。我们回顾了三元量化的发展历程,并从投影函数和优化方法的角度探讨了现有三元量化方法之间的关系。