Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT) and for GPU utilization. However, LLM request routing, that is, assigning each inference request to a GPU instance, is particularly challenging: execution is highly input-dependent; batching and KV-cache reuse create strong cross-request coupling; and latency responds nonlinearly to context length, model/engine settings, and heterogeneous accelerators. As a result, simple traditional load balancing algorithms, and even heuristics tailored for LLM inference, fail to achieve good performance. We present Lodestar, a novel learning-based request routing system for distributed GPU clusters. Lodestar continuously collects a snapshot of the cluster at per-request level, including real-time instance state, request characteristics, and observed performance, and trains an online reward predictor that it uses to route inference requests to the instance that will maximize given reward (e.g., minimizing TTFT). Lodestar is cloud-native and works seamlessly with existing serving stacks (vLLM). With continuous online adaptation to changing workloads and infrastructure conditions, Lodestar achieves 1.41x lower average TTFT and 1.47x lower P99 TTFT on average (up to 2.15x/1.86x on homogeneous and 4.38x/4.42x on heterogeneous clusters) compared to a state-of-the-art prefix cache and load-aware heuristic, and learns these efficient routing strategies within about 5 minutes, based on experiments in a public cloud GPU cluster.
翻译:高效处理大语言模型(LLM)推理任务对于降低用户感知延迟(如首次令牌生成时间TTFT)和提升GPU利用率至关重要。然而,LLM请求调度(即将每个推理请求分配到GPU实例)面临特殊挑战:执行过程高度依赖输入特征;批处理与KV缓存复用导致请求间存在强耦合;延迟对上下文长度、模型/引擎设置及异构加速器呈现非线性响应。因此,传统负载均衡算法乃至针对LLM推理设计的启发式方法均难以取得良好性能。本文提出Lodestar,一种面向分布式GPU集群的基于学习的请求调度系统。Lodestar在每请求粒度持续采集集群快照(包括实时实例状态、请求特征及观测性能),训练在线奖励预测器,据此将推理请求路由至能最大化给定奖励(如最小化TTFT)的实例。该系统原生适配云环境,可与现有推理服务栈(如vLLM)无缝集成。通过持续在线适应动态工作负载与基础设施条件,Lodestar在公有云GPU集群实验中,相较于最先进的基于前缀缓存与负载感知的启发式方法,平均TTFT降低1.41倍,P99 TTFT平均降低1.47倍(同构集群最高达2.15倍/1.86倍,异构集群最高达4.38倍/4.42倍),且仅需约5分钟即可学习到这些高效调度策略。