Large Language Models (LLMs) have reshaped the landscape of artificial intelligence by demonstrating exceptional performance across various tasks. However, substantial computational requirements make their deployment challenging on devices with limited resources. Recently, compression methods using low-rank matrix techniques have shown promise, yet these often lead to degraded accuracy or introduce significant overhead in parameters and inference latency. This paper introduces \textbf{Mo}dular \textbf{De}composition (MoDeGPT), a novel structured compression framework that does not need recovery fine-tuning while resolving the above drawbacks. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions via reconstructing the module-level outputs. MoDeGPT is developed based on a theoretical framework that utilizes three well-established matrix decomposition algorithms -- Nystr\"om approximation, CR decomposition, and SVD -- and applies them to our redefined transformer modules. Our comprehensive experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods that rely on gradient information, and saves 98% of compute costs on compressing a 13B model. On \textsc{Llama}-2/3 and OPT models, MoDeGPT maintains 90-95% zero-shot performance with 25-30% compression rates. Moreover, the compression can be done on a single GPU within a few hours and increases the inference throughput by up to 46%.
翻译:大语言模型(LLMs)通过在各种任务中展现出卓越性能,重塑了人工智能的格局。然而,其巨大的计算需求使得在资源受限的设备上部署面临挑战。近期,采用低秩矩阵技术的压缩方法显示出潜力,但这些方法往往导致精度下降,或引入显著的参数量与推理延迟开销。本文提出 **Mo**dular **De**composition(MoDeGPT),一种新颖的结构化压缩框架,无需恢复性微调即可解决上述缺陷。MoDeGPT 将 Transformer 模块划分为由矩阵对构成的子模块,并通过重构模块级输出来降低隐藏维度。MoDeGPT 基于一个理论框架开发,该框架利用三种成熟的矩阵分解算法——Nyström 近似、CR 分解和 SVD——并将其应用于我们重新定义的 Transformer 模块。我们的综合实验表明,MoDeGPT 在不依赖反向传播的情况下,达到或超越了以往基于梯度信息的结构化压缩方法,并在压缩 13B 模型时节省了 98% 的计算成本。在 Llama-2/3 和 OPT 模型上,MoDeGPT 在 25-30% 的压缩率下保持了 90-95% 的零样本性能。此外,压缩过程可在单张 GPU 上数小时内完成,并将推理吞吐量提升高达 46%。