To enhance the efficiency of the attention mechanism within large language models (LLMs), previous works primarily compress the KV cache or group attention heads, while largely overlooking redundancy between layers. Our comprehensive analyses across various LLMs show that highly similar attention patterns persist within most layers. It's intuitive to reduce the redundancy by sharing attention weights across layers. However, further analysis reveals two challenges: (1) Directly sharing the weight matrix without carefully rearranging the attention heads proves to be ineffective; (2) Shallow layers are vulnerable to small deviations in attention weights. Driven by these insights, we introduce LISA, a lightweight substitute for self-attention in well-trained LLMs. LISA employs tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate differences in layer-wise attention weights. Evaluations encompassing 13 typical benchmarks demonstrate that LISA maintains high response quality in terms of accuracy and perplexity while reducing redundant attention calculations within 53%-84% of the total layers. Our implementations of LISA achieve a 6x compression of Q and K matrices within the attention mechanism, with maximum throughput improvements 19.5%, 32.3%, and 40.1% for LLaMA3-8B, LLaMA2-7B, and LLaMA2-13B, respectively.
翻译:为提高大语言模型(LLMs)中注意力机制的效率,先前工作主要压缩KV缓存或对注意力头进行分组,而很大程度上忽视了层间的冗余性。我们对多种LLMs的综合分析表明,高度相似的注意力模式在大多数层中持续存在。直觉上可通过跨层共享注意力权重来减少冗余。然而进一步分析揭示了两大挑战:(1)若不精心重排注意力头而直接共享权重矩阵,效果不佳;(2)浅层网络对注意力权重的微小偏差非常敏感。基于这些发现,我们提出了LISA——一种用于已训练LLMs的自注意力轻量化替代方案。LISA采用微型前馈网络对齐相邻层间的注意力头,并利用低秩矩阵近似层间注意力权重的差异。涵盖13个典型基准的评估表明,LISA在准确性和困惑度方面保持了高响应质量,同时减少了总层数中53%-84%的冗余注意力计算。我们的LISA实现实现了注意力机制中Q和K矩阵的6倍压缩,在LLaMA3-8B、LLaMA2-7B和LLaMA2-13B模型上分别取得了19.5%、32.3%和40.1%的最大吞吐量提升。