Quantization for deep neural networks (DNNs) is the process of mapping the parameter values of DNNs from original data types to other data types of lower precision to reduce model sizes and make inference faster. Quantization often maps different original values to a single quantized value because the range of the original values is larger than the range of the quantized values. This leads to the degradation of the accuracy of the quantized DNNs. Outliers are a main cause of the degradation of quantization resolution because they enlarge the range of original values. To solve the problem, the percentile method is often used to clip outliers. However, clipping the outliers has another problem of removing the important and strong signals in the DNNs. This paper proposes SplitQuant to keep the outliers and improve the quantization resolution at the same time. SplitQuant narrows down the range of the original values and mitigates the effect of outliers by splitting each quantizable layer into three mathematically equivalent layers and applies different scaling factors. Especially, weights and biases are clustered into lower, middle and upper clusters for optimized split. By preprocessing DNNs with SplitQuant, quantization algorithms can achieve better results. SplitQuant was applied on two BERT-Tiny models and improved the accuracy of INT2 quantization by 3.3%p and 2.1%p, achieving accuracies comparable to those of the original FP32 models.
翻译:深度神经网络(DNN)量化是指将DNN参数值从原始数据类型映射至更低精度的其他数据类型,以减小模型规模并加速推理过程。由于原始值的范围通常大于量化值的范围,量化往往会将多个不同的原始值映射到同一个量化值上,从而导致量化后DNN的精度下降。离群值是导致量化分辨率下降的主要原因,因为它们会扩大原始值的取值范围。为解决该问题,常采用百分位数方法对离群值进行截断。然而,截断离群值会带来另一个问题,即可能移除DNN中重要且强烈的信号。本文提出SplitQuant方法,旨在保留离群值的同时提升量化分辨率。SplitQuant通过将每个可量化层拆分为三个数学等价的层并施加不同的缩放因子,从而缩小原始值的取值范围并缓解离群值的影响。特别地,权重与偏置被聚类为下、中、上三个簇以实现优化分裂。通过对DNN进行SplitQuant预处理,量化算法能够获得更好的结果。在两个BERT-Tiny模型上应用SplitQuant后,INT2量化的准确率分别提升了3.3%和2.1%,达到了与原始FP32模型相当的精度水平。