Deep neural network (DNN) deployment has been confined to larger hardware devices due to their expensive computational requirements. This challenge has recently reached another scale with the emergence of large language models (LLMs). In order to reduce both their memory footprint and latency, a promising technique is quantization. It consists in converting floating point representations to low bit-width fixed point representations, usually by assuming a uniform mapping onto a regular grid. This process, referred to in the literature as uniform quantization, may however be ill-suited as most DNN weights and activations follow a bell-shaped distribution. This is even worse on LLMs whose weight distributions are known to exhibit large, high impact, outlier values. In this work, we propose an improvement over the most commonly adopted way to tackle this limitation in deep learning models quantization, namely, non-uniform quantization. NUPES leverages automorphisms to preserve the scalar multiplications. Such transformations are derived from power functions. However, the optimization of the exponent parameter and weight values remains a challenging and novel problem which could not be solved with previous post training optimization techniques which only learn to round up or down weight values in order to preserve the predictive function. We circumvent this limitation with a new paradigm: learning new quantized weights over the entire quantized space. Similarly, we enable the optimization of the power exponent, i.e. the optimization of the quantization operator itself during training by alleviating all the numerical instabilities. The resulting predictive function is compatible with integer-only low-bit inference. We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.
翻译:深度神经网络(DNN)因其高昂的计算需求而长期受限于大型硬件设备。随着大型语言模型(LLM)的出现,这一挑战近期达到了新的规模。为降低其内存占用和延迟,量化是一种有前景的技术。该技术将浮点表示转换为低位宽定点表示,通常假设在规则网格上进行均匀映射。然而,这一过程中所述文献中的均匀量化可能并不适用,因为大多数DNN的权重和激活值服从钟形分布。对于权重分布已知存在大量高影响异常值的LLM来说,这一问题更为严重。在本工作中,我们针对深度学习模型量化中处理这一局限的最常用方法——非均匀量化——提出了改进。NUPES利用自同构来保留标量乘法,这些变换源于幂函数。然而,指数参数和权重值的优化仍是一个具有挑战性的新问题,以往的训练后优化技术仅能学习对权重值进行向上或向下取整以保持预测函数,无法解决此问题。我们通过一种新范式规避了这一局限:在整个量化空间上学习新的量化权重。类似地,我们通过消除所有数值不稳定性,实现了幂指数的优化,即训练过程中量化算子本身的优化。得到的预测函数兼容纯整数低位推理。我们展示了该方法在无数据驱动和数据驱动配置下实现最先进压缩率的能力。