Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indicator for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may be prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\times$ to $10.8\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments. Code is available at https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064.
翻译:大语言模型(LLM)在工业领域各类信息处理与管理任务中发挥着日益重要的作用。其中许多任务以大批量甚至离线方式执行,其性能指标为吞吐量。这类任务通常呈现前缀共享特征,即不同提示输入可能部分包含公共前缀。然而,现有LLM推理引擎倾向于优化流式请求,在支持具有前缀共享特性的大批量任务方面存在局限。现有方案采用基于LRU的缓存来复用请求间公共前缀的KV上下文,但可能因隐式缓存管理导致待复用的KV上下文被过早淘汰。此外,面向流式处理的系统未利用请求批处理信息,无法在批处理场景下最优混合解码令牌与预填充块,导致GPU利用率不足。我们提出BatchLLM以解决上述问题。BatchLLM显式识别全局公共前缀,将共享相同前缀的请求调度在一起以最大化KV上下文复用。BatchLLM通过重排序请求,优先调度解码占比较高的请求以优化解码令牌与后续预填充块的混合,并采用内存敏感的令牌批处理(memory-centric token batching)扩大令牌批处理规模,从而提升GPU利用率。大量评估表明,在不同硬件环境下,BatchLLM相较于vLLM和SGLang在微基准测试和典型工业负载上实现了1.3倍至10.8倍的性能提升。代码已开源至https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064。