Sirius: Contextual Sparsity with Correction for Efficient LLMs

With the blossom of large language models (LLMs), inference efficiency becomes increasingly important. Various approximation methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without quality degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, CS significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. Despite the gap in end-to-end accuracy, we observed that sparse models often share general problem-solving logic and require only a few token corrections to recover the original model performance. This paper introduces Sirius, an efficient correction mechanism, which significantly recovers CS models quality on reasoning tasks while maintaining its efficiency gain. Sirius is evaluated on 6 models with 8 difficult generation tasks in reasoning, math, and coding and shows consistent effectiveness and efficiency. Also, we carefully develop a system implementation for Sirius and show that Sirius achieves roughly 20% reduction in latency for 8B model on-chip and 35% reduction for 70B model offloading. We open-source our implementation of Sirius at https://github.com/Infini-AI-Lab/Sirius.git.

翻译：随着大语言模型（LLMs）的蓬勃发展，推理效率变得日益重要。为降低推理成本，各种近似方法被提出。上下文稀疏性（CS）因其免训练特性以及能够在无明显质量损失的情况下实现更高压缩比而备受关注。然而，通过对上下文稀疏性方法在各种复杂生成任务上的全面评估，我们发现尽管CS在提示理解任务上表现成功，但在推理、演绎及基于知识的任务上，CS会显著降低模型性能。尽管端到端准确率存在差距，我们观察到稀疏模型通常共享通用的问题解决逻辑，仅需少量标记校正即可恢复原始模型性能。本文提出天狼星（Sirius），一种高效的校正机制，能在保持效率增益的同时，显著恢复CS模型在推理任务上的质量。天狼星在6个模型、涵盖推理、数学和代码生成的8项困难生成任务上进行了评估，结果一致证明了其有效性与高效性。此外，我们精心开发了天狼星的系统实现，结果表明天狼星可使8B模型在片上推理的延迟降低约20%，使70B模型在卸载推理场景下的延迟降低约35%。我们在 https://github.com/Infini-AI-Lab/Sirius.git 开源了天狼星的实现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日