Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.
翻译:近期大语言模型(LLM)的进展展示了其多样化的能力。我们提出了一种新颖算法——分阶段推测性解码,以加速小批量、设备端场景下的大语言模型推理。通过改进先前推测性解码的研究,我们解决了小批量推理中算力强度低的问题。首先,我们将推测性批次重构为一棵树结构,从而降低生成成本并增加每批次的预期token数。其次,我们新增了第二阶段的推测性解码。综合而言,我们在完美保留输出质量的同时,将762M参数的GPT-2-L模型的单批次解码延迟降低了3.16倍。