LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling

High-quality LLM request scheduling requires achieving two key objectives: whether the routed instance has KV$ to accelerate the request execution and whether the workload is balanced across instances. Achieving both objectives is challenging because pursuing one objective may compromise the other. Current approaches adopt various combinators (e.g., linear combinations) to compute a scheduling score combining indicators for the two objectives, which are complex in that they either require significant workload-specific hyperparameter tuning or model-hardware-aware simulator development, and could still lead to suboptimal performance. In this paper, we show that using a simple multiplication of two carefully chosen indicators-one for KV$-aware (new prefill tokens if routed to an instance) and one for load balancing-aware (current batch size of the instance)-as the scheduling score can simultaneously achieve both objectives well without any hyperparameter tuning. The key idea is that the multiplied score considers both objectives in a manner similar to a linear combination, with a nice property that the original hyperparameters are canceled out during comparison so we don't need tuning to find the best parameters. The two indicators are chosen based on our analysis of LLM characteristics, and our extensive experiments show that this simple approach can reduce TTFT by 92% and 52%, and TPOT by 21% and 20%, compared to vLLM-v1 and a production scheduler on real-world workloads covering chatbots, API calls, and coding agents. We also mathematically derive the conditions under which multiplication may fail, and find that such conditions are extremely rare in practice and can be detected (and mitigated) beforehand.

翻译：高质量的LLM请求调度需实现两个关键目标：被路由实例是否具备KV缓存以加速请求执行，以及各实例间的工作负载是否均衡。同时达成这两个目标具有挑战性，因为追求一个目标可能会损害另一个目标。现有方法采用多种组合算子（例如线性组合）来计算融合这两个目标指标的调度分数，这些方法复杂度较高：要么需要大量针对特定工作负载的超参数调优，要么需要开发模型-硬件感知的模拟器，且仍可能导致次优性能。本文证明，将两个精心选择的指标——一个用于KV缓存感知（若请求路由至某实例时的新前缀填充token数），另一个用于负载均衡感知（该实例当前批处理大小）——进行简单乘法运算作为调度分数，即可在无需任何超参数调优的情况下同时良好实现两个目标。其核心思想在于：乘法分数以类似于线性组合的方式兼顾两个目标，并具有优良特性——原始超参数在比较过程中会被抵消，因此无需通过调优寻找最佳参数。这两个指标的选择基于我们对LLM特性的分析，大量实验表明：相较于vLLM-v1和实际生产调度器，在涵盖聊天机器人、API调用和代码代理的真实工作负载场景中，这种简单方法能将TTFT降低92%和52%，并将TPOT降低21%和20%。我们还通过数学推导得出了乘法运算可能失效的条件，并发现在实践中此类条件极为罕见且可预先检测（并缓解）。