Recently, the number of parameters in DNNs has explosively increased, as exemplified by LLMs (Large Language Models), making inference on small-scale computers more difficult. Model compression technology is, therefore, essential for integration into products. In this paper, we propose a method of quantization-aware training. We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional computation cost during inference. Then, we quantize the weights by the scaled round-clip function with the weight standardization. We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions. We call this method Magic for the age of Quantised DNNs (MaQD). Experimental results show that our quantization method can be achieved with minimal accuracy degradation.
翻译:近年来,以大型语言模型(LLMs)为代表的深度神经网络参数量呈爆炸式增长,这使得在小型计算机上进行推理变得更加困难。因此,模型压缩技术对于产品集成至关重要。本文提出了一种量化感知训练方法。我们引入了一种与最小批大小无关、且在推理过程中无需额外计算成本的新型归一化方法(层-批归一化)。随后,我们通过带权重标准化的缩放舍入裁剪函数对权重进行量化,并采用相同函数对激活函数进行量化,同时应用代理梯度训练同时包含量化权重和量化激活函数的模型。我们将该方法命名为"量化深度神经网络时代的魔法(MaQD)"。实验结果表明,我们的量化方法能够在精度损失最小的情况下实现模型压缩。