Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration. We release Frontier at https://github.com/NetX-lab/Frontier.
翻译:现代大语言模型服务已不再同质化或单一化。生产系统现结合了分离式执行、复杂并行化、运行时优化以及有状态工作负载(如推理、智能体与强化学习部署)。模拟对于探索这一日益增长的设计空间极具吸引力,然而现有模拟器缺乏所需的架构完整性与决策级保真度。其单体重构抽象不适用于分离式服务,而平均情况分析代理可能扭曲服务水平协议预测,甚至颠倒优化结论。我们提出Frontier,一款面向现代大语言模型推理服务的离散事件模拟器。Frontier采用分离式抽象设计,通过建模共置、前缀-解码分离(PDD)和注意力-前馈网络分离(AFD)及角色特定的集群工作节点,捕获现代服务系统的结构与动态;在调度器-批处理引擎循环中融入关键运行时优化(如CUDA图、推测解码);并支持新兴工作负载的有状态请求。该模拟器还能在复杂工作负载组合的多样化服务场景中,对计算、通信及内存开销提供准确且可泛化的预测。在16块H800 GPU测试平台上,Frontier的平均吞吐量误差低于4%。与最先进的模拟器相比,它将共置场景下的端到端延迟误差从44.9%降低至6.4%,在分离式场景下从51.7%降低至2.6%。Frontier可在商用CPU上扩展至超过1000块GPU,并支持诸如依赖服务水平协议的帕累托前沿探索、异构分离式分配、智能推理调度验证及强化学习后训练重配置等新用例。我们已在https://github.com/NetX-lab/Frontier开源Frontier。