Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. In this work, we investigate this inconsistency and bridge the gap by rectifying delayed gradients through basis rotation, restoring scalable asynchronous training while maintaining performance. Specifically, we observe that the deleterious effects of delayed gradients are exacerbated when the Hessian eigenbasis is misaligned with the standard coordinate basis. We demonstrate that this misalignment prevents coordinate-wise adaptive schemes, such as Adam, from effectively leveraging curvature-aware adaptivity. This failure leads to significant oscillations in the optimization trajectory and, consequently, slower convergence. We substantiate these findings through both rigorous theoretical analysis and empirical evaluation. To address this challenge, we propose the use of basis rotation, demonstrating that it effectively mitigates the alignment issue and significantly accelerates convergence in asynchronous settings. For example, our training of a 1B-parameter LLM with basis rotation achieves the same training loss in 76.8% fewer iterations compared to the best-performing asynchronous pipeline parallel training baseline.

翻译：异步流水线并行通过消除同步执行中固有的流水线气泡，最大限度地提高硬件利用率，为实现高效的大规模分布式训练提供了路径。然而，这种效率提升可能因梯度陈旧性而受到损害，即使用延迟梯度对模型进行即时更新会给优化过程引入噪声。至关重要的是，我们发现了一个关键但常被忽视的病理现象：这种延迟与流水线深度呈线性增长，从根本上破坏了该方法原本旨在提供的可扩展性。在本工作中，我们研究了这一矛盾，并通过基旋转校正延迟梯度来弥合差距，在保持性能的同时恢复了可扩展的异步训练。具体而言，我们观察到，当海森矩阵的特征基与标准坐标基未对齐时，延迟梯度的有害影响会加剧。我们证明，这种未对齐会阻碍诸如Adam等坐标自适应方案有效利用曲率感知的自适应性。这种失败导致优化轨迹出现显著振荡，从而减慢收敛速度。我们通过严格的理论分析和实证评估证实了这些发现。为应对这一挑战，我们提出了使用基旋转的方法，并证明它能有效缓解对齐问题，并在异步设置中显著加速收敛。例如，我们使用基旋转训练一个10亿参数的LLM，与性能最佳的异步流水线并行训练基线相比，达到相同训练损失所需的迭代次数减少了76.8%。