StageFrontier: Synchronization-Aware Stage Accounting for Distributed ML Training

When a distributed training job slows down, the hard part is knowing where to look. Synchronization hides the cause: a stall on one rank shows up as a wait on the others, so a data delay on a single rank can surface as backward time across the group. The cheap dashboards that run all the time -- per-stage averages and maxima -- misread this, double-counting the same exposed delay or burying the slow rank in an average, while full profilers see it clearly but are far too heavy to leave on. StageFrontier is an always-on signal that closes this gap. Each rank reports only a short ordered vector of coarse stage durations -- data, forward, backward, and so on -- timed with CPU wall-clock, with no synchronized clocks and no kernel tracing. At each stage boundary, StageFrontier takes the cumulative time of whichever rank is furthest along; the increments of this frontier form an exact, additive accounting of the step's exposed time and point to the stage and rank where group-visible delay first appears, telling an operator where to aim a heavy profiler, not which fix to make. The accounting is exact, but the coarse signal alone cannot tell whether a leading stage truly caused the slowdown or merely ran alongside it; StageFrontier labels the windows where that distinction needs more evidence instead of guessing. A PyTorch implementation adds under 0.2% throughput overhead through 128 ranks on Gloo and NCCL, places injected faults among its top two suspects on all 50 rows of a hidden-rank DDP test, and recovers the same top-stage routing as PyTorch Profiler, HTA, and Nsight Systems once their traces are reduced to the same coarse stages -- from a 0.11 MB summary instead of a 15.81 GB trace.

翻译：当分布式训练作业变慢时，难点在于确定排查方向。同步机制掩盖了根本原因：单个秩的停滞会表现为其他秩的等待，因此单一秩的数据延迟可能转化为整个组的反向时间。持续运行的简易仪表盘（各阶段平均值与最大值）会误读这一现象，要么重复统计相同的暴露延迟，要么将慢速秩淹没在平均值中，而完整剖析器虽能清晰识别问题，却因开销过高无法常驻。StageFrontier是一种填补这一空白的常驻信号。每个秩仅报告一段由粗粒度阶段持续时间（数据、前向、反向等）组成的有序短向量，采用CPU挂钟计时，无需同步时钟或内核追踪。在每个阶段边界，StageFrontier取所有秩中进度最靠前的累积时间；该前沿的增量构成了步骤暴露时间的精确可加性归因，并指出群组可见延迟首次出现的阶段和秩，从而告知操作员应针对何处部署重型剖析器，而非直接给出修复方案。归因结果精确，但仅凭粗粒度信号无法判定领先阶段是真实导致了减速还是仅与之并存；StageFrontier会标记需要更多证据来区分这两种情况的窗口，而非进行猜测。一个基于PyTorch的实现通过Gloo和NCCL在128个秩上仅增加不到0.2%的吞吐开销，在隐藏秩DDP测试的全部50行中将注入的故障排至前两个嫌疑位置，并能在将PyTorch Profiler、HTA和Nsight Systems的追踪记录降级至相同粗粒度阶段后（基于0.11 MB摘要而非15.81 GB追踪记录），恢复出相同的前置阶段路由。