Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high speculation acceptance rates to improve performance. Combined with a variable rate of acceptance across tasks, speculative inference techniques can result in reduced performance. Additionally, pipeline-parallel designs require many user requests to maintain maximum utilization. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15$\times$ improvement in generation speed over standard speculative inference. PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation, the former improving latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs, even in the middle of inference.
翻译:近年来,跨计算机集群的大语言模型推理已成为研究热点,许多加速技术借鉴了CPU推测执行的思路。这些技术虽然缓解了内存带宽相关的瓶颈,但也增加了单次推理的端到端延迟,需要较高的推测接受率才能提升性能。由于不同任务的接受率存在差异,推测推理技术可能导致性能下降。此外,流水线并行设计需要大量用户请求才能维持最大利用率。为解决这些问题,我们提出PipeInfer——一种流水线化推测加速技术,旨在降低单请求场景下的词元间延迟并提升系统利用率,同时增强对低推测接受率与低带宽互连的容忍度。相比标准推测推理,PipeInfer在生成速度上最高可提升2.15倍。其性能提升通过连续异步推测与早期推理取消两项机制实现:前者通过将单词元推理与多个推测运行同时执行来改善延迟与生成速度;后者则通过跳过已失效的推理计算(即使在推理过程中)来提升速度与降低延迟。