Contextual sparsity is one of the approaches used to reduce computational complexity in the inference process of large language models (LLMs). Existing techniques for efficient LLM inference acceleration based on contextual sparsity with minimal accuracy degradation require training sparse pattern predictors. This paper presents a framework for accelerating inference of ReGLU-based feed-forward networks (FFNs) within LLMs. The proposed framework provides a fast, training-free method for building sparse pattern predictors using truncation-aware singular value decomposition (SVD) of the gate projection matrix, along with a threshold calibration algorithm, and inference executors supporting conditional computation on CUDA and CANN devices. Experiments on three sparse LLMs with an average activation sparsity level of 90% in the FFNs demonstrate up to a 1.8x reduction in end-to-end decoding time while maintaining less than 1% degradation in benchmark scores on tasks involving complex math and code generation. This work advances the deployment of LLMs on edge devices.
翻译:上下文稀疏性是降低大型语言模型(LLM)推理过程计算复杂度的方法之一。现有基于上下文稀疏性且能保持精度损失最小的高效LLM推理加速技术,需要训练稀疏模式预测器。本文提出了一种用于加速LLM中基于ReGLU的前馈网络(FFN)推理的框架。该框架提供了一种快速、无需训练的方法来构建稀疏模式预测器,该方法利用门投影矩阵的截断感知奇异值分解(SVD)以及阈值校准算法,并支持在CUDA和CANN设备上进行条件计算的推理执行器。在三个稀疏LLM上的实验表明,其FFN平均激活稀疏度达到90%,在涉及复杂数学和代码生成的任务上,端到端解码时间最多减少1.8倍,同时基准测试分数下降小于1%。这项工作推动了LLM在边缘设备上的部署。