As large language models (LLMs) grow in popularity for their diverse capabilities, improving the efficiency of their inference systems has become increasingly critical. Batching LLM requests is a critical step in scheduling the inference jobs on servers (e.g. GPUs), enabling the system to maximize throughput by allowing multiple requests to be processed in parallel. However, requests often have varying generation lengths, causing resource underutilization, as hardware must wait for the longest-running request in the batch to complete before moving to the next batch. We formalize this problem from a queueing-theoretic perspective, and aim to design a control policy which is throughput-optimal. We propose Multi-Bin Batching, a simple yet effective method that can provably improve LLM inference throughput by grouping requests with similar (predicted) execution times into predetermined bins. Through a combination of theoretical analysis and experiments, including real-world LLM inference scenarios, we demonstrate significant throughput gains compared to standard batching approaches.
翻译:随着大语言模型(LLM)因其多样化能力而日益普及,提升其推理系统的效率变得愈发关键。批处理LLM请求是在服务器(例如GPU)上调度推理任务的关键步骤,它允许多个请求并行处理,从而使系统能够最大化吞吐量。然而,请求的生成长度往往各不相同,这会导致资源利用不足,因为硬件必须等待批次中运行时间最长的请求完成后才能处理下一个批次。我们从排队论的角度形式化这一问题,旨在设计一种吞吐量最优的控制策略。我们提出了多箱批处理,这是一种简单而有效的方法,它通过将具有相似(预测)执行时间的请求分组到预定的箱中,可证明地提高LLM推理吞吐量。通过理论分析和实验(包括真实世界的LLM推理场景)相结合,我们证明了与标准批处理方法相比,该方法能带来显著的吞吐量提升。