Modern LLM GPU fleets are provisioned for worst-case context lengths that the vast majority of requests never approach, wasting GPU capacity on idle KV-cache slots. We present FleetOpt, a framework that starts from first principles: given a workload's prompt-length CDF and a P99 TTFT target, derive the minimum-cost fleet analytically, then deploy it in practice. The analytical core models each pool as an M/G/c queue and derives that the minimum-cost fleet is a two-pool architecture -- a short-context pool and a long-context pool -- with an optimal boundary B* satisfying an equal marginal GPU cost condition across both pools. The fundamental barrier to achieving B* is the cost cliff: a hard routing step where requests just above B* consume 8x--42x more GPU capacity than requests just below it (depending on the context window ratio), creating a structural disincentive to lower the boundary. Compress-and-Route (C&R) is the implementation mechanism that resolves this barrier. Gateway-layer extractive compression trims borderline requests below B* before the engine ever sees them, converting the hard hardware boundary into a software parameter read from the workload CDF. The two components are unified in the FleetOpt offline planner: given a CDF and SLO, it returns the optimal (n_s*, n_l*, B*, gamma*) in under 1 ms. On three production traces, the combined framework reduces total GPU cost by 6--82% versus a homogeneous fleet, with C&R contributing 1--44 percentage points beyond plain pool routing depending on workload archetype. The analytical model is validated against a discrete-event simulator (inference-fleet-sim) with <= 3% error on predicted GPU utilization across all pools and workloads.
翻译:现代大语言模型GPU集群通常按最坏情况上下文长度进行资源配置,而绝大多数请求远未达到该长度,导致大量GPU算力浪费于闲置的KV缓存槽位。本文提出FleetOpt框架,该框架从基本原理出发:给定工作负载的提示长度累积分布函数及P99首词元延迟目标,通过解析方法推导最小成本集群配置,并在实践中部署。其分析核心将每个资源池建模为M/G/c队列,并推导出最小成本集群为双池架构——短上下文池与长上下文池——其最优边界B*需满足两池间边际GPU成本相等的条件。实现B*的根本障碍在于成本断崖:当请求长度略高于B*时,其消耗的GPU算力将达到略低于B*请求的8至42倍(具体倍数取决于上下文窗口比例),这种硬件层面的刚性边界形成了降低边界值的内在阻力。压缩路由作为实现机制解决了这一障碍:网关层通过提取式压缩将边界请求修剪至B*以下,使硬件刚性边界转化为可从工作负载累积分布函数读取的软件参数。这两部分组件在FleetOpt离线规划器中实现统一:给定累积分布函数与服务等级目标,规划器可在1毫秒内返回最优参数组(n_s*, n_l*, B*, gamma*)。在三个生产环境追踪数据集上的实验表明,该联合框架相较于同构集群可降低总GPU成本6%至82%,其中压缩路由机制相较于普通池路由策略可额外贡献1至44个百分点的成本节约(具体取决于工作负载类型)。该分析模型通过离散事件模拟器验证,在所有资源池和工作负载中预测的GPU利用率误差均不超过3%。