Sizing a GPU fleet for LLM inference is harder than it looks. The obvious questions -- how many GPUs, which type, where to split a two-pool fleet -- have no closed-form answers. They depend on the full token-length distribution, the routing policy, and queueing dynamics that turn ugly under heavy-tailed workloads. Existing tools optimize per-engine configuration for a fixed GPU count; none of them address the upstream question of how many GPUs to buy and how to arrange them. inference-fleet-sim fills that gap. It combines analytical M/G/c queueing with discrete-event simulation (DES) to find the minimum-cost fleet configuration that empirically meets a P99 TTFT SLO. It includes a physics-informed GPU performance model covering A10G, A100, and H100 across monolithic, two-pool-routed, and disaggregated topologies, all without requiring access to real hardware. We run the tool on seven fleet-planning scenarios drawn from two public workload traces (LMSYS, Azure) and one synthetic agent-heavy trace. Each one surfaces a result that simple analysis gets wrong -- the right split threshold, the cheapest GPU type, whether an apparently idle fleet is actually broken -- and shows why joint simulation of queueing, routing, and hardware is necessary to find it.
翻译:为LLM推理配置GPU集群规模比表面看起来更为复杂。诸如“需要多少GPU、选用何种类型、在双池集群中如何划分”等直观问题并无封闭解。它们取决于完整的令牌长度分布、路由策略以及在重尾负载下变得棘手的排队动态。现有工具仅针对固定GPU数量优化单引擎配置,均未解决“应采购多少GPU以及如何部署”这一上游问题。inference-fleet-sim填补了这一空白。该工具结合解析型M/G/c排队模型与离散事件仿真,以经验方式寻找满足P99首令牌时间服务等级目标的最低成本集群配置。它包含基于物理原理的GPU性能模型,涵盖A10G、A100和H100在单体式、双池路由式及解耦式拓扑中的表现,且无需实际硬件支持。我们使用来自两条公开工作负载轨迹(LMSYS、Azure)和一条合成智能体密集型轨迹的七个集群规划场景对工具进行验证。每个案例均揭示了简单分析方法会得出错误结论的典型问题——例如正确的划分阈值、最经济的GPU类型、看似闲置的集群是否实际已失效——并证明必须对排队、路由和硬件进行联合仿真才能获得正确结论。