Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles. We present SARATHI to address these challenges. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. During inference, the prefill chunk saturates GPU compute, while the decode requests 'piggyback' and cost up to an order of magnitude less compared to a decode-only batch. Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughput and up to 4.25x higher decode throughput. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x.
翻译:大型语言模型(LLM)推理包含两个显著不同的阶段——处理输入提示的预填充阶段与自回归生成输出令牌的解码阶段。预填充阶段在较小批量下能有效饱和GPU计算资源,而解码阶段因每次请求每次仅生成一个令牌导致计算利用率低下。此外,预填充与解码时间的变化在使用流水线并行时还会造成微批次间的不均衡,因气泡问题进一步加剧效率损失。我们提出SARATHI来应对这些挑战。SARATHI采用分块预填充技术,将预填充请求分割为大小相同的块,并采用解码最大化批处理策略,即构建包含单个预填充块且其余位置填充解码请求的批次。推理过程中,预填充块饱和GPU计算,而解码请求"搭载"运行,其成本相比纯解码批次降低一个数量级。分块预填充支持从单个预填充请求构建多个解码最大化批次,从而最大化可搭载解码请求的覆盖范围。此外,这些批次的统一计算设计缓解了微批次间的不均衡,显著减少了流水线气泡。我们的技术在不同模型与硬件上均实现了推理性能的显著提升。在A6000 GPU上运行LLaMA-13B模型时,SARATHI将解码吞吐量提升高达10倍,端到端吞吐量加速比达1.33倍。在A100 GPU上运行LLaMA-33B时,我们实现了1.25倍的端到端吞吐量提升与高达4.25倍的解码吞吐量提升。当结合流水线并行应用于GPT-3时,SARATHI将气泡减少6.29倍,端到端吞吐量提升1.91倍。