Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back, and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully-curated tasks to re-define the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts, and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity on knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. All our related codes are planed to be open-sourced.
翻译:尽管现代大型语言模型取得了卓越成就,但其计算和内存开销却极为高昂。近期,多项研究在无需训练或数据的模型压缩(剪枝与量化)中取得了显著成功,实现了50-60%的稀疏度,并将每个权重的位宽降至3-4比特,且与未压缩基线相比,困惑度退化微乎其微。当最新研究致力于开发日益复杂的压缩方法时,我们的工作却选择退一步,重新评估现有最先进压缩方法的有效性——这些方法依赖一个相当简单且广受质疑的指标(即便对于稠密语言模型亦然):困惑度。我们提出了知识密集型压缩语言模型基准测试(LLM-KICK),这是一套精心设计的任务集合,旨在重新定义压缩语言模型的评估协议。该基准表明,压缩模型与其稠密对应物高度一致,而困惑度无法捕捉其真实能力中的细微变化。LLM-KICK揭示了当前最先进压缩方法的诸多优点与困境:所有剪枝方法在知识密集型任务中均出现显著性能下降,有时甚至发生在极低的稀疏度(如25-30%)下,且无法应对N:M稀疏性;现行量化方法比剪枝更为成功;然而,即使剪枝后的语言模型达到≥50%的稀疏度,仍能作为鲁棒的上下文检索与摘要系统使用;此外还有更多发现。LLM-KICK旨在全面评估压缩语言模型的语言理解、推理、生成、上下文检索、上下文摘要等能力。我们希望本研究能推动更好的语言模型压缩方法的发展。所有相关代码计划开源。