Compressing Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. In this work, we dive into how compression damages LLMs' inherent knowledge and the possible remedies. We start by proposing two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after LLM compression, hence necessitating the compressed model to (re)learn from data with additional parameters; the other presumes that knowledge is internally displaced and hence one requires merely "inference re-direction" with input-side augmentation such as prompting, to recover the knowledge-related performance. Extensive experiments are then designed to (in)validate the two conjectures. We observe the promise of prompting in comparison to model tuning; we further unlock prompting's potential by introducing a variant called Inference-time Dynamic Prompting (IDP), that can effectively increase prompt diversity without incurring any inference overhead. Our experiments consistently suggest that compared to the classical re-training alternatives such as LoRA, prompting with IDP leads to better or comparable post-compression performance recovery, while saving the extra parameter size by 21x and reducing inference latency by 60%. Our experiments hence strongly endorse the conjecture of "knowledge displaced" over "knowledge forgotten", and shed light on a new efficient mechanism to restore compressed LLM performance. We additionally visualize and analyze the different attention and activation patterns between prompted and re-trained models, demonstrating they achieve performance recovery in two different regimes.
翻译:压缩大语言模型(LLM)通常会导致性能下降,尤其是在知识密集型任务中。本研究深入探讨压缩如何损害LLM所固有的知识,以及可能的补救措施。我们首先提出关于这种损害本质的两个猜想:一是压缩后某些知识被遗忘(或擦除),因此压缩模型需要通过额外参数从数据中(重新)学习;二是知识在内部发生位移,只需通过输入侧增强(如提示)进行“推理重定向”即可恢复与知识相关的性能。随后我们设计了大量实验来验证(或否定)这两个猜想。我们观察到提示相比模型微调更有前景;通过引入一种名为推理时动态提示(IDP)的变体,我们进一步释放了提示的潜力——该方法能在不增加推理开销的情况下有效提升提示多样性。实验一致表明,与传统的重新训练方案(如LoRA)相比,结合IDP的提示能实现更好或相当的后压缩性能恢复,同时将额外参数量减少21倍,并将推理延迟降低60%。因此,我们的实验强烈支持“知识位移说”而非“知识遗忘说”,并为恢复压缩LLM性能提供了一种高效的新机制。此外,我们可视化分析了提示模型与重新训练模型之间不同的注意力与激活模式,证明它们通过两种不同路径实现了性能恢复。