Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch. We propose token-budget-aware pool routing: estimate each request's total token budget using a self-calibrating per-category bytes-per-token ratio, then dispatch it to one of two vLLM pools -- a high-throughput short pool or a high-capacity long pool -- each right-sized for its workload class. The ratio is learned online via exponential moving average from usage.prompt_tokens feedback, requiring no tokenizer. A closed-form cost model, savings = alpha * (1 - 1/rho), predicts fleet-level GPU savings from two observable quantities: the short-traffic fraction alpha and the throughput gain ratio rho. On traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M serving Llama-3-70B on A100 GPUs, token-budget routing reduces GPU instances by 17-39% (\$1.2-2.0M/yr at 1,000 req/s), with savings verified by a self-contained discrete-event simulator. A case study projecting Qwen3-235B-A22B on AMD MI300X at 10,000 req/s shows \$15.4M/yr in savings. The algorithm adds O(1) dispatch overhead, self-calibrates across content types without a tokenizer, and composes with PagedAttention, continuous batching, and prefill-decode disaggregation.
翻译:生产环境的vLLM集群为每个实例按最坏情况的上下文长度进行配置,导致80-95%的短请求浪费4-8倍并发能力,同时引发KV缓存故障——包括内存溢出崩溃、抢占风暴和请求拒绝。这两个问题共享同一个根本原因:配置与流量不匹配。我们提出面向令牌预算的池路由方法:利用自校准的每类别字节-令牌比例估算每个请求的总令牌预算,随后将其分配到两个vLLM池之一——高吞吐短池或高容量长池,每个池均按工作负载类别进行合理配置。该比例通过指数移动平均法从usage.prompt_tokens反馈在线学习,无需分词器。闭合形式成本模型savings = alpha * (1 - 1/rho)通过两个可观测量预测集群级GPU节省:短流量比例alpha和吞吐增益比rho。基于Azure LLM推理数据集和LMSYS-Chat-1M的跟踪数据,在A100 GPU上部署Llama-3-70B服务时,令牌预算路由将GPU实例减少17-39%(在1000请求/秒下每年节省120-200万美元),并通过自包含离散事件模拟器验证节省效果。针对Qwen3-235B-A22B在AMD MI300X上以10000请求/秒运行的案例研究显示,每年可节省1540万美元。该算法增加O(1)的调度开销,无需分词器即可跨内容类型自校准,并与PagedAttention、连续批处理和预填充-解码分离技术兼容。