Low Rank Decomposition of matrix - splitting a large matrix into a product of two smaller matrix offers a means for compression that reduces the parameters of a model without sparsification, and hence delivering more speedup on modern hardware. Moreover, unlike quantization, the compressed linear layers remain fully differentiable and all the parameters trainable, while being able to leverage the existing highly efficient kernels over floating point matrices. We study the potential to compress Large Language Models (LLMs) for monolingual Code generation via Low Rank Decomposition (LoRD) and observe that ranks for the linear layers in these models can be reduced by upto 39.58% with less than 1% increase in perplexity. We then use Low Rank Decomposition (LoRD) to compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single A100. The compressed models speeds up inference by up to 22.35% with just a single line of change in code over huggingface's implementation with pytorch backend. Low Rank Decomposition (LoRD) models remain compatible with state of the art near-lossless quantization method such as SpQR, which allows leveraging further compression gains of quantization. Lastly, QLoRA over Low Rank Decomposition (LoRD) model further reduces memory requirements by as much as 21.2% over vanilla QLoRA while offering similar gains from parameter efficient fine tuning. Our work shows Low Rank Decomposition (LoRD) as a promising new paradigm for LLM compression.
翻译:矩阵的低秩分解——将一个大矩阵拆分为两个较小矩阵的乘积——提供了一种无需稀疏化即可减少模型参数的压缩手段,从而在现代硬件上实现更高加速比。此外,与量化不同,压缩后的线性层保持完全可微且所有参数可训练,同时能够利用现有高效的浮点矩阵内核。我们研究了通过低秩分解(LoRD)压缩用于单语言代码生成的大语言模型(LLM)的潜力,并观察到这些模型中线性层的秩最多可降低39.58%,而困惑度增幅小于1%。随后,我们使用低秩分解(LoRD)将StarCoder 16B模型压缩至13.2B参数(无性能损失)和12.3B参数(HumanEval Pass@1分数仅轻微下降),整个过程在单张A100上耗时不到10分钟。压缩后的模型在huggingface基于PyTorch后端的实现基础上仅需修改一行代码,即可将推理速度提升高达22.35%。低秩分解(LoRD)模型仍兼容SpQR等当前最优的近乎无损量化方法,从而可充分利用量化的进一步压缩增益。最后,在低秩分解(LoRD)模型上应用QLoRA,相比纯QLoRA可将内存需求降低高达21.2%,同时提供参数高效微调带来的同等收益。本研究显示,低秩分解(LoRD)是大语言模型压缩领域一种极具前景的新范式。