The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$. Here, $\mathbf{L}$ and $\mathbf{R}$ are low rank factors, and the entries of $\mathbf{Q}$, $\mathbf{L}$ and $\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its $\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $\mathbf{L}$ and $\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$, where $\mathbf{X}$ is the calibration data, and $\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of $\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-$2$ $7$B/$70$B and LlaMa-$3$ $8$B models obtained using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter. The implementation is available at: \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera}.
翻译:当前大语言模型(LLM)的庞大尺寸使其难以部署在内存受限的边缘设备上。本文提出 $\rm CALDERA$——一种新的训练后 LLM 压缩算法,它利用权重矩阵 $\mathbf{W}$ 固有的低秩结构,通过低秩、低精度分解 $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$ 来近似表示。其中,$\mathbf{L}$ 和 $\mathbf{R}$ 为低秩因子,$\mathbf{Q}$、$\mathbf{L}$ 和 $\mathbf{R}$ 的条目均被量化。通过将每个层替换为其 $\mathbf{Q} + \mathbf{L}\mathbf{R}$ 分解形式来压缩模型,并评估压缩后模型的零样本性能。此外,$\mathbf{L}$ 和 $\mathbf{R}$ 易于进行低秩自适应,从而进一步提升零样本性能。$\rm CALDERA$ 通过将该分解表述为一个优化问题 $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$ 来获得此分解,其中 $\mathbf{X}$ 为校准数据,且 $\mathbf{Q}$、$\mathbf{L}$、$\mathbf{R}$ 被约束为可使用低精度格式表示。我们利用秩约束回归框架建立了 $\rm CALDERA$ 近似误差的理论上界,并通过分析目标秩和量化比特预算的影响,研究了压缩比与模型性能之间的权衡。实验结果表明,使用 $\rm CALDERA$ 压缩 LlaMa-$2$ $7$B/$70$B 和 LlaMa-$3$ $8$B 模型,在每参数小于 $2.5$ 比特的范围内,其性能优于现有的训练后 LLM 压缩技术。实现代码可在 \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera} 获取。