Packet-level discrete-event simulation (PLDES) is a prevalent tool for evaluating detailed performance of large model training. Although PLDES offers high fidelity and generality, its slow performance has plagued networking practitioners. Existing optimization techniques either simplify the network model, resulting in large errors; or execute it in parallel using multiple processors, with an upper bound on speedup. This paper explores an alternative optimization direction that reduces the computational loads of PLDES while maintaining high fidelity. Our key insight is that, in distributed LLM training, packet-level traffic behaviors often exhibit repetitive contention patterns and steady-states where flow rates stabilize, ignoring these redundant discrete events speeds up the simulation considerably and the error is negligible. We realize this idea by proposing Wormhole, a user-transparent PLDES kernel capable of automatically memoization for unsteady-states and skipping for steady-states. Wormhole adopts network partitioning, state memoization and reuse, and rate-based steady-state identification to accurately determine the periods of each flow's steady-state, while maintaining simulation consistency after fast-forwarding. Experiments demonstrate that Wormhole can achieve a 744x speedup over the original ns-3 (510x for MoE workload), with a bounded error of <1%. Applying current multithreading parallel techniques and Wormhole together allows a 1012x speedup, reducing the simulation time for one GPT-13B training under 128 GPUs from 9 hours to 5 minutes.
翻译:包级离散事件仿真(PLDES)是评估大规模模型训练详细性能的常用工具。尽管PLDES具有高保真度和通用性,但其缓慢的执行速度一直困扰着网络领域的研究者。现有的优化技术要么简化网络模型,导致较大误差;要么使用多处理器并行执行,其加速效果存在上限。本文探索了一种新的优化方向,在保持高保真度的同时降低PLDES的计算负载。我们的核心洞察是:在分布式大语言模型训练中,包级流量行为常呈现重复的竞争模式和速率趋于稳定的稳态阶段,忽略这些冗余的离散事件可显著加速仿真且误差可忽略不计。我们通过设计Wormhole实现这一思想——一个对用户透明的PLDES内核,能够自动对非稳态阶段进行记忆化处理,并对稳态阶段进行跳过操作。Wormhole采用网络分区、状态记忆化与重用、以及基于速率的稳态识别技术,以准确判定各流量稳态阶段的持续时间,同时在快进后保持仿真一致性。实验表明,Wormhole相比原始ns-3可实现最高744倍加速(MoE负载下为510倍),且误差上限小于1%。将现有多线程并行技术与Wormhole结合使用,可获得1012倍加速,使128 GPU环境下GPT-13B训练的仿真时间从9小时缩短至5分钟。