Due to the substantial scale of Large Language Models (LLMs), the direct application of conventional compression methodologies proves impractical. The computational demands associated with even minimal gradient updates present challenges, particularly on consumer-grade hardware. This paper introduces an innovative approach for the parametric and practical compression of LLMs based on reduced order modelling, which entails low-rank decomposition within the feature space and re-parameterization in the weight space. Notably, this compression technique operates in a layer-wise manner, obviating the need for a GPU device and enabling the compression of billion-scale models within stringent constraints of both memory and time. Our method represents a significant advancement in model compression by leveraging matrix decomposition, demonstrating superior efficacy compared to the prevailing state-of-the-art structured pruning method.
翻译:由于大语言模型(LLM)规模庞大,传统压缩方法的直接应用不切实际。即便是最小梯度的更新,其计算需求也会带来挑战,尤其是在消费级硬件上。本文提出了一种基于降阶建模的LLM参数化且实用的压缩方法,该方法在特征空间中进行低秩分解,并在权重空间中实现重新参数化。值得注意的是,这种压缩技术以逐层方式运行,无需GPU设备,能够在严格的内存和时间限制下压缩数十亿规模的模型。我们的方法通过利用矩阵分解代表了模型压缩的一项重要进展,相较于当前最先进的结构化剪枝方法,展现出更优的效能。