Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens. Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
翻译:大语言模型(LLMs)近期在各类任务中展现出卓越性能,但其庞大的参数量导致模型推理时产生显著延迟。尤其在采用自回归解码方法时,每次前向过程仅生成单个token,未能充分利用GPU的并行计算能力。本文提出一种新型并行解码方法——隐状态传递(hidden transfer),该方法可在单次前向传播中同时解码多个连续token。其核心理念是将前文上下文的中间隐状态迁移至待生成未来token的伪隐状态,随后伪隐状态通过后续Transformer层吸收更多语义信息,从而实现对未来token的精准预测。此外,我们采用新型树注意力机制同步生成与校验多个输出序列候选,既保证了无损生成,又进一步提升了方法的生成效率。实验验证了该方法的有效性,并通过大量分析实验佐证了设计动机。在加速指标方面,本方法优于包括Medusa和自推测解码在内的所有单模型加速技术。