In recent years, large language models (LLMs) have driven substantial intelligent transformation across diverse industries. Commercial LLM training is typically performed over data center networks (DCNs) comprising hundreds to thousands of GPUs, with multiple devices collocated per node. As network scale expands, inter-node communication becomes a primary bottleneck to training efficiency. Network-state simulators therefore play a crucial role by enabling cost-effective evaluation of network configurations and parallelization strategies through faithful emulation of DCN dynamics during LLM training. However, existing simulators are constrained by a efficiency-fidelity tradeoff, as packet-level simulators (PLSs) incur prohibitive runtime overhead, whereas flow-level simulators (FLSs) compromise essential modeling accuracy. In this paper, we develop \texttt{HyGra}, a hybrid-granularity network-state simulator that exploits intrinsic network dynamics in LLM training to adaptively switch simulation granularity. Specifically, \texttt{HyGra} employs packet-level simulation during non-steady phases with transient fluctuations and flow-level simulation during steady phases with periodic patterns, thereby accelerating execution while preserving high fidelity. Moreover, it requires no specialized hardware, supports single-machine deployment, and is compatible with existing simulators. Experiments based representative commercial LLM workloads, including ChatGPT, DeepSeek, and Qwen, show that \texttt{HyGra} achieves up to 15.4$\times$ speedup under single parallelization strategy and 7.8$\times$ under hybrid parallelization strategies while maintaining high accuracy.
翻译:近年来,大语言模型(LLMs)推动了各行业的重大智能化转型。商用LLM训练通常在包含数百至数千个GPU的数据中心网络(DCNs)中进行,每个节点上部署多个设备。随着网络规模扩大,节点间通信成为训练效率的主要瓶颈。因此,网络状态模拟器通过在大语言模型训练过程中对DCN动态进行高保真仿真,为网络配置和并行化策略的评估提供了一种经济高效的手段,发挥着至关重要的作用。然而,现有模拟器受限于效率与保真度之间的权衡:包级模拟器(PLSs)会产生过高的运行时开销,而流级模拟器(FLSs)则牺牲了必要的建模精度。本文开发了 \texttt{HyGra},一种混合粒度的网络状态模拟器,它利用LLM训练中固有的网络动态特性,自适应地切换模拟粒度。具体而言,\texttt{HyGra} 在具有瞬态波动的非稳态阶段采用包级模拟,而在具有周期性模式的稳态阶段采用流级模拟,从而在保持高保真度的同时加速执行。此外,它无需专用硬件,支持单机部署,并与现有模拟器兼容。基于包括ChatGPT、DeepSeek和Qwen在内的代表性商用LLM工作负载的实验表明,\texttt{HyGra} 在单一并行化策略下可实现高达15.4$\times$的加速比,在混合并行化策略下可实现7.8$\times$的加速比,同时保持高精度。