While the numerous parameters in Large Language Models (LLMs) contribute to their superior performance, this massive scale makes them inefficient and memory-hungry. Thus, they are hard to deploy on commodity hardware, such as one single GPU. Given the memory and power constraints of such devices, model compression methods are widely employed to reduce both the model size and inference latency, which essentially trades off model quality in return for improved efficiency. Thus, optimizing this accuracy-efficiency trade-off is crucial for the LLM deployment on commodity hardware. In this paper, we introduce a new perspective to optimize this trade-off by prompting compressed models. Specifically, we first observe that for certain questions, the generation quality of a compressed LLM can be significantly improved by adding carefully designed hard prompts, though this isn't the case for all questions. Based on this observation, we propose a soft prompt learning method where we expose the compressed model to the prompt learning process, aiming to enhance the performance of prompts. Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model (with a joint 4-bit quantization and 50% weight pruning compression), allowing them to match their uncompressed counterparts on popular benchmarks. Also, we demonstrate that these learned prompts can be transferred across various datasets, tasks, and compression levels. Hence with this transferability, we can stitch the soft prompt to a newly compressed model to improve the test-time accuracy in an ``in-situ'' way.
翻译:尽管大语言模型(LLM)中庞大的参数数量带来了卓越性能,但这种大规模导致其效率低下且内存消耗巨大,因此难以部署在单GPU等通用硬件上。鉴于这类设备的存储和功耗限制,模型压缩方法被广泛采用以减少模型规模和推理延迟,这本质上是牺牲模型质量以换取效率提升。因此,优化这种精度-效率权衡对于在通用硬件上部署LLM至关重要。本文提出通过提示压缩模型来优化这一权衡的新视角。具体而言,我们首先观察到:对于某些问题,添加精心设计的硬提示可显著提升压缩LLM的生成质量,但这一现象并非适用于所有问题。基于此观察,我们提出一种软提示学习方法——让压缩模型参与提示学习过程,旨在增强提示的性能。实验分析表明,我们的软提示策略显著提升了经8倍压缩的LLaMA-7B模型(采用联合4位量化与50%权重剪枝压缩)的性能,使其在主流基准测试中达到与未压缩模型相当的水平。此外,我们证明这些习得提示可跨数据集、任务和压缩级别迁移。借助这种可迁移性,我们能够将软提示"原地"拼接至新压缩模型,以提升测试时的准确性。