Compressing LLMs: The Truth is Rarely Pure and Never Simple

Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back, and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully-curated tasks to re-define the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts, and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity on knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. All our related codes are planed to be open-sourced.

翻译：尽管现代大型语言模型取得了卓越成就，但其计算和内存开销却极为高昂。近期，多项研究在无需训练或数据的模型压缩（剪枝与量化）中取得了显著成功，实现了50-60%的稀疏度，并将每个权重的位宽降至3-4比特，且与未压缩基线相比，困惑度退化微乎其微。当最新研究致力于开发日益复杂的压缩方法时，我们的工作却选择退一步，重新评估现有最先进压缩方法的有效性——这些方法依赖一个相当简单且广受质疑的指标（即便对于稠密语言模型亦然）：困惑度。我们提出了知识密集型压缩语言模型基准测试（LLM-KICK），这是一套精心设计的任务集合，旨在重新定义压缩语言模型的评估协议。该基准表明，压缩模型与其稠密对应物高度一致，而困惑度无法捕捉其真实能力中的细微变化。LLM-KICK揭示了当前最先进压缩方法的诸多优点与困境：所有剪枝方法在知识密集型任务中均出现显著性能下降，有时甚至发生在极低的稀疏度（如25-30%）下，且无法应对N:M稀疏性；现行量化方法比剪枝更为成功；然而，即使剪枝后的语言模型达到≥50%的稀疏度，仍能作为鲁棒的上下文检索与摘要系统使用；此外还有更多发现。LLM-KICK旨在全面评估压缩语言模型的语言理解、推理、生成、上下文检索、上下文摘要等能力。我们希望本研究能推动更好的语言模型压缩方法的发展。所有相关代码计划开源。