Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. We trace this pathology to a specific property of the optimization landscape: the misalignment between the Hessian eigenbasis and the standard coordinate basis, which triggers oscillations in the update trajectories of coordinate-wise adaptive optimizers. We identify that these oscillations cause delayed updates to diverge from their true counterparts, invalidating their use for current iterations. This insight is formalized through theoretical analysis, including a convergence bound showing that basis misalignment amplifies the delay penalty, and substantiated with empirical evaluation. To address this, we propose basis rotation, a framework that rotates the optimizer's coordinate system to align with the Hessian eigenbasis, keeping delayed updates useful. We theoretically demonstrate that basis rotation minimizes basis misalignment, thereby counteracting the conditions that amplify delay penalties. Empirically, in training up to a 3B-parameter LLM, basis rotation reduces the required iterations by 81.7\% compared to the best-performing asynchronous baseline.
翻译:异步流水线并行通过消除同步执行中固有的流水线气泡来最大化硬件利用率,为高效的大规模分布式训练提供了路径。然而,这种效率提升可能会因梯度陈旧性而受损,即使用延迟梯度进行即时模型更新会在优化过程中引入噪声。关键的是,我们发现了一种关键但常被忽视的病理现象:这种延迟随流水线深度线性增长,从根本上破坏了该方法原本旨在提供的可扩展性。我们将这一病理现象归因于优化景观的一个特定属性:海森特征基与标准坐标基之间的错位,这会触发坐标自适应优化器更新轨迹中的震荡。我们识别出这些震荡导致延迟更新偏离其真实对应项,从而使其无法用于当前迭代。这一见解通过理论分析(包括一个表明基错位会放大延迟惩罚的收敛界)得到了形式化,并通过实证评估得到了验证。为解决此问题,我们提出了基旋转框架,该框架旋转优化器的坐标系以与海森特征基对齐,从而保持延迟更新的有效性。我们从理论上证明了基旋转能最小化基错位,从而抵消放大延迟惩罚的条件。在训练多达30亿参数的大语言模型时,实证结果表明,与最佳性能的异步基线相比,基旋转将所需的迭代次数减少了81.7%。