HyGra: Accelerating Network-State Simulation for LLM Training in DCNs via Adaptive Packet-Flow Granularity

In recent years, large language models (LLMs) have driven substantial intelligent transformation across diverse industries. Commercial LLM training is typically performed over data center networks (DCNs) comprising hundreds to thousands of GPUs, with multiple devices collocated per node. As network scale expands, inter-node communication becomes a primary bottleneck to training efficiency. Network-state simulators therefore play a crucial role by enabling cost-effective evaluation of network configurations and parallelization strategies through faithful emulation of DCN dynamics during LLM training. However, existing simulators are constrained by a efficiency-fidelity tradeoff, as packet-level simulators (PLSs) incur prohibitive runtime overhead, whereas flow-level simulators (FLSs) compromise essential modeling accuracy. In this paper, we develop \texttt{HyGra}, a hybrid-granularity network-state simulator that exploits intrinsic network dynamics in LLM training to adaptively switch simulation granularity. Specifically, \texttt{HyGra} employs packet-level simulation during non-steady phases with transient fluctuations and flow-level simulation during steady phases with periodic patterns, thereby accelerating execution while preserving high fidelity. Moreover, it requires no specialized hardware, supports single-machine deployment, and is compatible with existing simulators. Experiments based representative commercial LLM workloads, including ChatGPT, DeepSeek, and Qwen, show that \texttt{HyGra} achieves up to 15.4$\times$ speedup under single parallelization strategy and 7.8$\times$ under hybrid parallelization strategies while maintaining high accuracy.

翻译：近年来，大语言模型（LLMs）推动了多个行业的智能化变革。商业大语言模型训练通常在数据中心网络（DCNs）上进行，这些网络包含数百至数千个GPU，每个节点可同时部署多个设备。随着网络规模扩大，节点间通信成为训练效率的主要瓶颈。网络状态仿真器通过忠实模拟大语言模型训练期间的DCN动态，能够以低成本评估网络配置和并行化策略，因此发挥着关键作用。然而，现有仿真器受限于效率与保真度的权衡：包级仿真器（PLSs）带来过高的运行时开销，而流级仿真器（FLSs）则牺牲了必要的建模精度。本文开发了HyGra，一种混合粒度的网络状态仿真器，它利用大语言模型训练中的内在网络动态自适应切换仿真粒度。具体而言，HyGra在具有瞬态波动的非稳定阶段采用包级仿真，在具有周期性模式的稳定阶段采用流级仿真，从而在保持高保真度的同时加速执行。此外，它无需专用硬件，支持单机部署，并与现有仿真器兼容。基于代表性商业大语言模型工作负载（包括ChatGPT、DeepSeek和Qwen）的实验表明，HyGra在单一并行化策略下实现高达15.4倍的加速，在混合并行化策略下实现7.8倍的加速，同时保持高精度。