Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, these methods still result in high inference time overhead, remaining suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \texttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbf{Uni}fication in \textbf{Att}e\textbf{n}tion (\textbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training.
翻译:后训练对于使大语言模型适应现实世界应用至关重要。部署后训练模型面临重大挑战,主要源于巨大的内存开销和显著的推理延迟。现有研究已发现大语言模型中存在显著冗余,并提出了高效的架构,即层内KV共享和跨层KV共享。然而,这些方法仍会导致较高的推理时间开销,对于后训练预训练大语言模型而言仍非最优。在本文中,我们发现 \texttt{Softmax} 操作是大语言模型推理的主要瓶颈,并发现其在后训练过程中实际上高度冗余。我们提出注意力机制中的Softmax统一方法,这是一种新颖的后训练方法,通过统一跨Transformer块的Softmax激活来降低大语言模型的推理成本。此外,UniAttn采用线性投影来补偿由Softmax统一引起的误差。实验表明,UniAttn在显著降低推理成本的同时,性能与标准后训练相当,在后训练过程中优于现有高效架构。