This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD employs additional compute resources to parallelize the initiation of subsequent token decoding during the current token decoding. This innovative method reduces decoding latency and reshapes the understanding of trade-offs in LLM decoding strategies. We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. Using this framework, we can analytically estimate the potential reduction in latency associated with our proposed method, achieved through the assessment of the match rate, represented as p_correct. The results demonstrate that the use of extra computational resources has the potential to accelerate LLM greedy decoding.
翻译:本文提出“预测流水线解码(PPD)”,一种加速大型语言模型(LLM)中贪婪解码的方法,同时保持与原始解码完全相同的输出。与传统策略不同,PPD利用额外计算资源,在当前令牌解码期间并行化后续令牌解码的启动。这种创新方法降低了解码延迟,重塑了对LLM解码策略中权衡的理解。我们开发了一个理论框架,用于分析计算与延迟之间的权衡。利用该框架,我们可以通过评估匹配率(表示为p_correct),解析性地估计所提方法可能带来的延迟减少。结果表明,使用额外计算资源有潜力加速LLM的贪婪解码。