LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Large language model (LLM) serving infrastructures are undergoing a shift toward heterogeneity and disaggregation. Modern deployments increasingly integrate diverse accelerators and near-memory processing technologies, introducing significant hardware heterogeneity, while system software increasingly separates computation, memory, and model components across distributed resources to improve scalability and efficiency. As a result, LLM serving performance is no longer determined by hardware or software choices in isolation, but by their runtime interaction through scheduling, data movement, and interconnect behavior. However, understanding these interactions remains challenging, as existing simulators lack the ability to jointly model heterogeneous hardware and disaggregated serving techniques within a unified, runtime-driven framework. This paper presents LLMServingSim 2.0, a unified system-level simulator designed to make runtime-driven hardware-software interactions in heterogeneous and disaggregated LLM serving infrastructures explicit and analyzable. LLMServingSim 2.0 embeds serving decisions and hardware behavior into a single runtime loop, enabling interaction-aware modeling of batching, routing, offloading, memory, and power. The simulator supports extensible integration of emerging accelerators and memory systems through profile-based modeling, while capturing dynamic serving behavior and system-level effects. We validate LLMServingSim 2.0 against real deployments, showing that it reproduces key performance, memory, and power metrics with an average error of 0.97%, while maintaining simulation times of around 10 minutes even for complex configurations. These results demonstrate that LLMServingSim 2.0 provides a practical bridge between hardware innovation and serving-system design, enabling systematic exploration and co-design for next-generation LLM serving infrastructures.

翻译：大语言模型服务基础设施正朝着异构化和解耦化方向发展。现代部署越来越多地集成多样化的加速器和近内存处理技术，引入了显著的硬件异构性；同时，系统软件日益将计算、内存和模型组件分离到分布式资源中，以提高可扩展性和效率。因此，LLM服务性能不再由硬件或软件选择单独决定，而是取决于它们通过调度、数据移动和互连行为在运行时的交互。然而，理解这些交互仍然具有挑战性，因为现有模拟器缺乏在统一的、运行时驱动的框架内联合建模异构硬件和解耦服务技术的能力。本文提出了LLMServingSim 2.0，一个统一的系统级模拟器，旨在使异构与解耦LLM服务基础设施中运行时驱动的硬件-软件交互变得明确且可分析。LLMServingSim 2.0将服务决策和硬件行为嵌入到单个运行时循环中，实现了批处理、路由、卸载、内存和功耗的交互感知建模。该模拟器通过基于配置文件的建模支持新兴加速器和内存系统的可扩展集成，同时捕捉动态服务行为和系统级效应。我们针对实际部署验证了LLMServingSim 2.0，结果表明它能够复现关键性能、内存和功耗指标，平均误差为0.97%，并且即使对于复杂配置也能保持约10分钟的模拟时间。这些结果证明，LLMServingSim 2.0在硬件创新与服务系统设计之间架起了一座实用的桥梁，为下一代LLM服务基础设施的系统性探索与协同设计提供了可能。