LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling

High-quality LLM request scheduling requires achieving two key objectives: whether the routed instance has KV$ to accelerate the request execution and whether the workload is balanced across instances. Achieving both objectives is challenging because pursuing one objective may compromise the other. Current approaches adopt various combinators (e.g., linear combinations) to compute a scheduling score combining indicators for the two objectives, which are complex in that they either require significant workload-specific hyperparameter tuning or model-hardware-aware simulator development, and could still lead to suboptimal performance. In this paper, we show that using a simple multiplication of two carefully chosen indicators-one for KV$-aware (new prefill tokens if routed to an instance) and one for load balancing-aware (current batch size of the instance)-as the scheduling score can simultaneously achieve both objectives well without any hyperparameter tuning. The key idea is that the multiplied score considers both objectives in a manner similar to a linear combination, with a nice property that the original hyperparameters are canceled out during comparison so we don't need tuning to find the best parameters. The two indicators are chosen based on our analysis of LLM characteristics, and our extensive experiments show that this simple approach can reduce TTFT by 92% and 52%, and TPOT by 21% and 20%, compared to vLLM-v1 and a production scheduler on real-world workloads covering chatbots, API calls, and coding agents. We also mathematically derive the conditions under which multiplication may fail, and find that such conditions are extremely rare in practice and can be detected (and mitigated) beforehand.

翻译：高质量的大语言模型请求调度需达成两个关键目标：被路由的实例是否拥有足够的KV$以加速请求执行，以及各实例间的工作负载是否均衡。同时实现这两个目标颇具挑战性，因为追求其中之一可能会损害另一个。现有方法采用多种组合策略（如线性组合）计算调度得分，这些策略将两个目标的指标相结合，但其复杂性在于要么需要大量针对特定工作负载的超参数调优，要么需要开发模型-硬件感知的仿真器，且仍可能导致次优性能。本文证明，使用两个精心选择的指标的简单乘法——一个用于KV$感知（若路由至某实例时的新预填充令牌数），另一个用于负载均衡感知（实例的当前批次大小）——作为调度得分，无需任何超参数调优即可同时良好地实现这两个目标。其核心思想在于，乘法得分以类似于线性组合的方式兼顾两个目标，且具有一个优良性质：原始超参数在比较过程中相互抵消，因此无需调优寻找最佳参数。这两个指标基于我们对大语言模型特征的分析而选定。广泛实验表明，与vLLM-v1及一个面向聊天机器人、API调用和编程代理等真实工作负载的生产调度器相比，这种简单方法可将TTFT分别降低92%和52%，将TPOT分别降低21%和20%。我们还从数学上推导了乘法可能失败的条件，并发现此类条件在实践中极为罕见，且可预先检测（并缓解）。