Token filtering has been proposed to enhance the utility of large language models (LLMs) by eliminating inconsequential tokens during training. While usingfewer tokens is expected to reduce computational workloads, existing methods have not yet achieved a real-world efficiency boost. This is primarily due to two factors: (1) existing work has inadequate sparsity for speedup, and (2) token filtering operates within a sparsity range that is non-standard in existing machine learning (ML) libraries and thus cannot be efficiently supported. This paper presents Centrifuge, a system that leverages algorithm and system co-design to unleash the full efficiency of token filtering in LLM training. At the algorithm level, Centrifuge filters activations of inconsequential tokens in the attention backward kernel to amplify the sparsity in backward computation. At the system level, Centrifuge proposes an automatic workflow that transforms sparse GEMM into dimension-reduced dense GEMM for optimized efficiency using standard ML libraries. Evaluations on models with various scales--from 1.1B to 40B--demonstrate that Centrifuge reduces backpropagation time by up to 49.9\% and end-to-end training time by up to 34.7\% when filtering 50\% of tokens. Utility assessments indicate that Centrifuge preserves the utility benefits of token filtering and significantly enhances model performance by up to 26.6\% compared to standard training. Centrifuge is designed for seamless integration into existing LLM training frameworks, enabling systems already utilizing token filtering to accelerate training with just one line of code.
翻译:令牌过滤已被提出用于通过消除训练中无关紧要的令牌来增强大型语言模型的效用。虽然使用更少的令牌有望减少计算工作量,但现有方法尚未实现实际效率提升。这主要归因于两个因素:(1)现有工作缺乏实现加速所需的稀疏度,以及(2)令牌过滤在现有机器学习库中非标准的稀疏度范围内运行,因此无法得到高效支持。本文提出Centrifuge系统,通过算法与系统的协同设计,释放令牌过滤在LLM训练中的全部效率。在算法层面,Centrifuge在注意力反向传播内核中过滤无关紧要令牌的激活,以放大反向计算中的稀疏性。在系统层面,Centrifuge提出一种自动化工作流,将稀疏GEMM转换为降维后的稠密GEMM,利用标准机器学习库实现优化效率。对1.1B至40B不同规模模型的评估表明,当过滤50%的令牌时,Centrifuge可将反向传播时间减少高达49.9%,端到端训练时间减少高达34.7%。效用评估显示,Centrifuge保留了令牌过滤的效用优势,并与标准训练相比,将模型性能显著提升高达26.6%。Centrifuge设计用于无缝集成到现有LLM训练框架中,使已采用令牌过滤的系统仅需一行代码即可加速训练。