This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD employs additional compute resources to parallelize the initiation of subsequent token decoding during the current token decoding. This method reduces decoding latency and reshapes the understanding of trade-offs in LLM decoding strategies. We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. Using this framework, we can analytically estimate the potential reduction in latency associated with our proposed method, achieved through the assessment of the match rate, represented as p_correct. The results demonstrate that the use of extra computational resources has the potential to accelerate LLM decoding. Additionally, we implement PPD and conduct preliminary experiments to empirically validate its efficacy, addressing potential practical overheads not covered by theoretical analysis.
翻译:本文提出“预测性流水线解码(PPD)”方法,在保持与原始解码完全一致输出的前提下,加速大语言模型(LLM)的贪心解码过程。与传统策略不同,PPD利用额外的计算资源,在当前词元解码期间并行启动后续词元的解码过程。该方法有效降低了解码延迟,并重塑了对LLM解码策略中权衡关系的理解。我们建立了理论分析框架,用以系统分析计算量与延迟之间的权衡关系。通过该框架,我们可以借助匹配率(表示为p_correct)的评估,解析性地估算所提方法可能实现的延迟降低潜力。研究结果表明,额外计算资源的使用具备加速LLM解码的潜力。此外,我们实现了PPD并进行了初步实验,通过实证验证其有效性,同时解决了理论分析未涵盖的实际系统开销问题。