LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Large language model (LLM) serving infrastructures are undergoing a shift toward heterogeneity and disaggregation. Modern deployments increasingly integrate diverse accelerators and near-memory processing technologies, introducing significant hardware heterogeneity, while system software increasingly separates computation, memory, and model components across distributed resources to improve scalability and efficiency. As a result, LLM serving performance is no longer determined by hardware or software choices in isolation, but by their runtime interaction through scheduling, data movement, and interconnect behavior. However, understanding these interactions remains challenging, as existing simulators lack the ability to jointly model heterogeneous hardware and disaggregated serving techniques within a unified, runtime-driven framework. This paper presents LLMServingSim 2.0, a unified system-level simulator designed to make runtime-driven hardware-software interactions in heterogeneous and disaggregated LLM serving infrastructures explicit and analyzable. LLMServingSim 2.0 embeds serving decisions and hardware behavior into a single runtime loop, enabling interaction-aware modeling of batching, routing, offloading, memory, and power. The simulator supports extensible integration of emerging accelerators and memory systems through profile-based modeling, while capturing dynamic serving behavior and system-level effects. We validate LLMServingSim 2.0 against real deployments, showing that it reproduces key performance, memory, and power metrics with an average error of 0.95%, while maintaining simulation times of around 10 minutes even for complex configurations. These results demonstrate that LLMServingSim 2.0 provides a practical bridge between hardware innovation and serving-system design, enabling systematic exploration and co-design for next-generation LLM serving infrastructures.

翻译：大语言模型（LLM）服务基础设施正朝着异构化与解耦化的方向演进。现代部署日益集成多样化的加速器和近内存处理技术，引入了显著的硬件异构性；同时，系统软件越来越趋向于将计算、内存和模型组件分布在分布式资源上，以提高可扩展性和效率。由此，LLM服务性能不再由硬件或软件选择单独决定，而是取决于它们通过调度、数据移动和互连行为在运行时产生的交互。然而，理解这些交互仍具挑战性，因为现有模拟器缺乏在统一、运行时驱动的框架内联合建模异构硬件与解耦服务技术的能力。本文提出LLMServingSim 2.0——一个统一的系统级模拟器，旨在明确并可分析地呈现异构和解耦化LLM服务基础设施中运行时驱动的硬件-软件交互。LLMServingSim 2.0将服务决策与硬件行为嵌入单一运行时循环中，实现了对批处理、路由、卸载、内存和功率的交互感知建模。该模拟器通过基于概要的建模支持可扩展地集成新兴加速器与内存系统，同时捕获动态服务行为与系统级效应。我们针对实际部署验证了LLMServingSim 2.0，结果表明，其在平均误差为0.95%的情况下重现了关键性能、内存和功率指标，即使对于复杂配置，仿真时间也保持在约10分钟。这些结果证明，LLMServingSim 2.0为硬件创新与服务系统设计之间搭建了实用桥梁，能够支持面向下一代LLM服务基础设施的系统性探索与协同设计。