Large language models have been shown to perform well on a variety of natural language processing problems. However, as the model size and the input sequence's length increase, the rapid increase of KV Cache significantly slows down inference speed. Therefore GQA model, as an alternative to MHA model, has been widely introduced into LLMs. In this work, we propose a low-cost method for pruning MHA models into GQA models with any compression ratio of key-value heads. Our method is based on $\mathit{L_0}$ masks to gradually remove redundant parameters. In addition, we apply orthogonal transformations to attention heads without changing the model to increase similarity between attention heads before pruning training, in order to further improve performance of the model. Our method can be compatible with rotary position embedding (RoPE), which means the model after training can be fully adapted to the mainstream standard GQA framework. Experiments demonstrate that our strategy can compress up to 87.5% of key-value heads of the LLaMA2-7B model without too much performance degradation, just achieved through supervised fine-tuning.
翻译:大型语言模型已被证明在各种自然语言处理任务上表现优异。然而,随着模型规模和输入序列长度的增加,KV缓存的快速增长显著降低了推理速度。因此,作为MHA模型的替代方案,GQA模型已被广泛引入到大型语言模型中。在本工作中,我们提出了一种低成本方法,可将MHA模型剪枝为具有任意键值头压缩比的GQA模型。我们的方法基于$\mathit{L_0}$掩码逐步移除冗余参数。此外,我们在不改变模型的前提下对注意力头应用正交变换,以在剪枝训练前增加注意力头之间的相似性,从而进一步提升模型性能。我们的方法可与旋转位置编码(RoPE)兼容,这意味着训练后的模型能够完全适配主流的标准化GQA框架。实验表明,我们的策略能够压缩LLaMA2-7B模型高达87.5%的键值头,而仅通过监督微调即可实现,且性能没有过多下降。