We present a post-training quantization algorithm with error estimates relying on ideas originating from frame theory. Specifically, we use first-order Sigma-Delta ($\Sigma\Delta$) quantization for finite unit-norm tight frames to quantize weight matrices and biases in a neural network. In our scenario, we derive an error bound between the original neural network and the quantized neural network in terms of step size and the number of frame elements. We also demonstrate how to leverage the redundancy of frames to achieve a quantized neural network with higher accuracy.
翻译:我们提出了一种基于框架理论思想的后训练量化算法,并附带误差估计。具体而言,我们利用有限单位范数紧框架的一阶Sigma-Delta ($\Sigma\Delta$) 量化方法,对神经网络中的权重矩阵和偏置进行量化。在此场景下,我们推导了原始神经网络与量化神经网络之间关于步长和框架元素数量的误差界。同时,我们展示了如何利用框架的冗余性来获得具有更高精度的量化神经网络。