Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during pre-training, resulting in a low hit rate for candidate tokens. In this paper, we propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process. This enhancement improves the hit rate of speculators and thus boosts the overall efficiency. Clover transmits the sequential knowledge from pre-speculated tokens via the Regressive Connection, then employs an Attention Decoder to integrate these speculated tokens. Additionally, Clover incorporates an Augmenting Block that modifies the hidden states to better align with the purpose of speculative generation rather than next token prediction. The experiment results demonstrate that Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, respectively, and exceeds the performance of the previously top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on Baichuan-Large, respectively.
翻译:摘要:大语言模型因自回归解码需求与当代GPU设计之间的不匹配而面临效率低下问题。具体而言,数十亿至数万亿参数需通过有限内存带宽加载至GPU缓存进行计算,而实际仅处理小批量词元。因此,GPU大部分时间耗费在内存传输而非计算上。近期,作为投机解码算法的一种,并行解码日益流行并在生成效率上展现出显著提升。该算法为大模型引入额外解码头,使其能同时预测多个后续词元,并在单次解码步骤中验证这些候选延续序列。然而,此方法偏离了预训练阶段使用的"下一个词元预测"训练目标,导致候选词元命中率较低。本文提出新型投机解码算法Clover,将序列知识融入并行解码过程,通过提升投机器的命中率来增强整体效率。Clover通过回归连接传递预投机词元的序列知识,并采用注意力解码器整合这些投机词元。此外,引入增强模块修改隐状态,使其更符合投机生成目标而非下一个词元预测。实验结果显示,在Baichuan-Small和Baichuan-Large模型上,Clover分别比基准方法提升最高91%和146%,并超过此前最优方法Medusa,分别提升37%和57%。