Compressing Large Language Models using Low Rank and Low Precision Decomposition

The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$. Here, $\mathbf{L}$ and $\mathbf{R}$ are low rank factors, and the entries of $\mathbf{Q}$, $\mathbf{L}$ and $\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its $\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $\mathbf{L}$ and $\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$, where $\mathbf{X}$ is the calibration data, and $\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of $\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-$2$ $7$B/$70$B and LlaMa-$3$ $8$B models obtained using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter. The implementation is available at: \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera}.

翻译：当前大语言模型（LLM）的庞大尺寸使其难以部署在内存受限的边缘设备上。本文提出 $\rm CALDERA$——一种新的训练后 LLM 压缩算法，它利用权重矩阵 $\mathbf{W}$ 固有的低秩结构，通过低秩、低精度分解 $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$ 来近似表示。其中，$\mathbf{L}$ 和 $\mathbf{R}$ 为低秩因子，$\mathbf{Q}$、$\mathbf{L}$ 和 $\mathbf{R}$ 的条目均被量化。通过将每个层替换为其 $\mathbf{Q} + \mathbf{L}\mathbf{R}$ 分解形式来压缩模型，并评估压缩后模型的零样本性能。此外，$\mathbf{L}$ 和 $\mathbf{R}$ 易于进行低秩自适应，从而进一步提升零样本性能。$\rm CALDERA$ 通过将该分解表述为一个优化问题 $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$ 来获得此分解，其中 $\mathbf{X}$ 为校准数据，且 $\mathbf{Q}$、$\mathbf{L}$、$\mathbf{R}$ 被约束为可使用低精度格式表示。我们利用秩约束回归框架建立了 $\rm CALDERA$ 近似误差的理论上界，并通过分析目标秩和量化比特预算的影响，研究了压缩比与模型性能之间的权衡。实验结果表明，使用 $\rm CALDERA$ 压缩 LlaMa-$2$ $7$B/$70$B 和 LlaMa-$3$ $8$B 模型，在每参数小于 $2.5$ 比特的范围内，其性能优于现有的训练后 LLM 压缩技术。实现代码可在 \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera} 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日