Compressing LLMs: The Truth is Rarely Pure and Never Simple

Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. The reproduced codes are available at https://github.com/VITA-Group/llm-kick.

翻译：尽管现代大型语言模型（LLMs）取得了显著成就，但它们面临着巨大的计算和内存开销。近期，多项研究工作在无需训练与无需数据的LLM压缩（剪枝与量化）方面取得了重要进展，实现了50-60%的稀疏度，并将每个权重的位宽减少至3或4比特，同时困惑度相比未压缩基线仅有可忽略的下降。当近期研究致力于开发日益复杂的压缩方法时，我们的工作退一步重新审视现有最先进（SoTA）压缩方法的有效性——这些方法依赖于一个相当简单且备受质疑的指标（困惑度，即便对于密集LLM也是如此）。我们提出了知识密集型压缩LLM基准测试（LLM-KICK），一套精心策划的任务集合，旨在重新定义压缩LLM的评估协议。这些任务与密集LLM具有显著一致性，而困惑度却无法捕捉其真实能力的细微变化。LLM-KICK揭示了当前SoTA压缩方法的诸多优势与不幸困境：所有剪枝方法均遭受显著的性能下降（有时甚至在较低的稀疏度，如25-30%下），并且在知识密集型任务中无法适应N:M稀疏结构；当前量化方法比剪枝更成功；然而，即使稀疏度≥50%的剪枝LLM在上下文检索与摘要系统中仍保持鲁棒性，等等。LLM-KICK旨在全面评估压缩LLM在语言理解、推理、生成、上下文检索、上下文摘要等方面的能力。我们希望本研究能够促进更优LLM压缩方法的发展。复现代码已开源在 https://github.com/VITA-Group/llm-kick。