Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.
翻译:激活稀疏性能够降低大语言模型推理前向过程中的计算开销与内存传输。现有方法存在局限:要么需要耗时的恢复训练,阻碍了实际应用;要么依赖基于经验幅度的剪枝,导致稀疏度波动且推理加速不稳定。本文提出LaRoSA(分层旋转稀疏激活),一种新颖的激活稀疏化方法,旨在提升大语言模型效率,且无需额外训练或基于幅度的剪枝。我们利用分层正交旋转将输入激活转换为更适合稀疏化的旋转形式。通过在旋转激活中采用Top-K选择策略,实现了稳定的模型级稀疏度与可靠的实际时间加速。LaRoSA适用于不同规模与类型的大语言模型,在保持性能损失最小的同时展现出稳健的推理加速效果。具体而言,在40%稀疏度下,LaRoSA对LLaMA2-7B仅产生0.17的困惑度差距,并实现稳定的1.30倍实际时间加速;在零样本任务中,其与稠密模型的准确率差距仅为0.54%,同时超越TEAL方法1.77%,超越CATS方法17.14%。