With the blossom of large language models (LLMs), inference efficiency becomes increasingly important. Various approximation methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without quality degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, CS significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. Despite the gap in end-to-end accuracy, we observed that sparse models often share general problem-solving logic and require only a few token corrections to recover the original model performance. This paper introduces Sirius, an efficient correction mechanism, which significantly recovers CS models quality on reasoning tasks while maintaining its efficiency gain. Sirius is evaluated on 6 models with 8 difficult generation tasks in reasoning, math, and coding and shows consistent effectiveness and efficiency. Also, we carefully develop a system implementation for Sirius and show that Sirius achieves roughly 20% reduction in latency for 8B model on-chip and 35% reduction for 70B model offloading. We open-source our implementation of Sirius at https://github.com/Infini-AI-Lab/Sirius.git.
翻译:随着大语言模型(LLMs)的蓬勃发展,推理效率变得日益重要。为降低推理成本,各种近似方法被提出。上下文稀疏性(CS)因其免训练特性以及能够在无明显质量损失的情况下实现更高压缩比而备受关注。然而,通过对上下文稀疏性方法在各种复杂生成任务上的全面评估,我们发现尽管CS在提示理解任务上表现成功,但在推理、演绎及基于知识的任务上,CS会显著降低模型性能。尽管端到端准确率存在差距,我们观察到稀疏模型通常共享通用的问题解决逻辑,仅需少量标记校正即可恢复原始模型性能。本文提出天狼星(Sirius),一种高效的校正机制,能在保持效率增益的同时,显著恢复CS模型在推理任务上的质量。天狼星在6个模型、涵盖推理、数学和代码生成的8项困难生成任务上进行了评估,结果一致证明了其有效性与高效性。此外,我们精心开发了天狼星的系统实现,结果表明天狼星可使8B模型在片上推理的延迟降低约20%,使70B模型在卸载推理场景下的延迟降低约35%。我们在 https://github.com/Infini-AI-Lab/Sirius.git 开源了天狼星的实现。