Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especially for smaller models in the 1-10B parameter range, which are well-suited for edge deployments. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique which enables for the first time near-lossless compression of LLMs across model scales, while reaching similar compression levels to previous methods. SpQR works by identifying and isolating outlier weights, which cause particularly-large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup thus making powerful LLMs available to consumer without any downsides. SpQR comes with efficient algorithms for both encoding weights into its format, as well as decoding them efficiently at runtime. Specifically, we provide an efficient GPU inference algorithm for SpQR which yields faster inference than 16-bit baselines at similar accuracy, while enabling memory compression gains of more than 4x.
翻译:近来大语言模型预训练技术的进步催生了具有卓越能力的高质量LLM。通过将此类LLM量化为每参数3-4比特,可将其部署于笔记本电脑和手机等内存受限设备,实现个性化应用。然而,量化至每参数3-4比特通常会导致中等到较高的精度损失,尤其对于1-10B参数范围内、适于边缘部署的较小模型而言。为解决此精度问题,我们提出稀疏量化表示(SpQR)——一种新型压缩格式与量化技术,首次在模型规模范围内实现LLM近乎无损压缩,同时达到与先前方法相似的压缩水平。SpQR通过识别并分离导致特大量化误差的异常值权重,以更高精度存储它们,同时将所有其他权重压缩至3-4比特,从而在高度精确的LLaMA和Falcon LLM上实现困惑度相对损失低于1%。这使得33B参数LLM可在一块24 GB消费级GPU上运行,性能无任何下降且提速15%,从而无需任何折衷即可向消费者提供强大LLM。SpQR配备高效算法,既支持将权重编码为其格式,也支持运行时高效解码。具体而言,我们为SpQR提供了高效的GPU推理算法,在相似精度下实现比16位基线更快的推理,同时实现超4倍的内存压缩增益。