We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
翻译:本文提出Q-Sparse——一种简单而有效的稀疏激活大语言模型训练方法。该方法通过实现LLM中激活函数的完全稀疏化,可显著提升推理效率。其核心机制是在前向传播中对激活值施加Top-K稀疏化处理,并在训练过程中采用直通估计器。本研究的主要成果包括:(1)Q-Sparse在保持与基线LLM性能相当的同时,能大幅提升推理效率;(2)我们提出了稀疏激活LLM的推理最优缩放定律;(3)Q-Sparse在不同训练场景中均表现有效,包括从头训练、现成LLM的持续训练以及微调;(4)Q-Sparse同时适用于全精度与1比特LLM(如BitNet b1.58)。特别值得注意的是,BitNet b1.58与Q-Sparse(可结合混合专家系统)的协同作用,为从根本上革新未来LLM在计算成本与能耗方面的效率奠定了基石并指明了清晰路径。