Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.
翻译:尽管大语言模型(LLM)已取得显著性能,但其庞大的参数量阻碍了其在资源受限硬件上的部署。低秩压缩能够同时降低内存占用和计算需求,但在所有层上应用统一的压缩率通常会导致显著的性能下降,且现有方法在解码阶段表现不佳。为解决这些问题,我们提出了细粒度低秩压缩器(FLRC),它能够高效地为每一层确定最优的秩分配,并融入渐进式低秩解码以保持文本生成质量。在多样化基准测试上的综合实验证明了FLRC的优越性,与最先进的低秩压缩方法相比,在摘要任务上实现了高达17%的ROUGE-L分数提升,从而建立了一个更鲁棒、更高效的框架来改进LLM推理。