The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference scales towards multiple GPUs and even multiple compute nodes, various coordination patterns, such as prefill-decode disaggregation and context migration, arise in serving systems. Most inference services today expose a coarse-grained request-level API with a pre-configured coordination strategy, limiting the ability to customize and dynamically reconfigure the coordination. In this paper, we propose LLM microserving, a multi-level architecture for structuring and programming LLM inference services. We introduces simple yet effective microserving APIs to support fine-grained sub-request level actions. A programmable router transforms user requests into sub-request calls, enabling the dynamic reconfiguration of serving patterns. To support diverse execution patterns, we develop a unified KV cache interface that handles various KV compute, transfer, and reuse scenarios. Our evaluation shows that LLM microserving can be reconfigured to support multiple disaggregation orchestration strategies in a few lines of Python code while maintaining state-of-the-art performance for LLM inference tasks. Additionally, it allows us to explore new strategy variants that reduce up to 47% of job completion time compared to the existing strategies.
翻译:近期大型语言模型(LLM)的进展对高效系统支持提出了强烈需求,以提升整体服务效率。随着LLM推理向多GPU乃至多计算节点扩展,服务系统中出现了多种协调模式,例如预填充-解码解耦与上下文迁移。当前大多数推理服务提供预配置协调策略的粗粒度请求级API,限制了协调策略的定制化与动态重构能力。本文提出LLM微服务化——一种用于构建和编程LLM推理服务的多层次架构。我们引入简洁而高效的微服务化API以支持细粒度子请求级操作。可编程路由器将用户请求转换为子请求调用,从而实现服务模式的动态重构。为支持多样化执行模式,我们开发了统一的KV缓存接口,可处理各类KV计算、传输与重用场景。评估结果表明,LLM微服务化可通过数行Python代码重构以支持多种解耦编排策略,同时在LLM推理任务中保持最先进的性能。此外,该架构使我们能够探索新的策略变体,与现有策略相比可降低高达47%的任务完成时间。