The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs because the permanent removal of attention heads or neurons from LLMs can significantly degrade accuracy. Prior work has attempted to model contextual sparsity using neural networks trained to predict activation magnitudes, which can be used to dynamically prune structures with low predicted activation magnitude. In this paper, we look beyond magnitude-based pruning criteria to assess attention head and neuron importance in LLMs. We developed a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns, resulting in over 15% improvement in end-to-end accuracy without increasing latency compared to previous methods. ShadowLLM achieves up to a 20\% speed-up over the state-of-the-art DejaVu framework. These enhancements are validated on models with up to 30 billion parameters. Our code is available at \href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM}.
翻译:大型语言模型(LLM)的高功耗和对延迟敏感的部署需求推动了量化与稀疏化等技术的发展。上下文稀疏化——即稀疏化模式依赖于输入——在LLM中至关重要,因为永久性地移除注意力头或神经元会显著降低模型精度。先前的研究尝试通过训练神经网络来预测激活强度,从而建模上下文稀疏化,并据此动态剪枝预测激活强度较低的结构。本文超越了基于强度的剪枝准则,重新评估LLM中注意力头与神经元的重要性。我们提出了一种名为ShadowLLM的新型预测器,该预测器能够模拟LLM的行为并实施更优的稀疏化模式,在保持延迟不变的前提下,相比现有方法将端到端准确率提升超过15%。与当前最先进的DejaVu框架相比,ShadowLLM实现了高达20%的加速效果。这些改进在参数量高达300亿的模型上得到了验证。我们的代码发布于\href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM}。