Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient arrives. Vanilla ASGD applies each arriving gradient with the same weight. When local data distributions are heterogeneous, this becomes problematic: faster workers contribute more updates, and we show theoretically that the method is biased toward a frequency-weighted average of the local objectives rather than the desired global objective. Existing remedies typically move away from the simple ASGD template by introducing gathering phases, buffering, or extra memory. We show that this is unnecessary. Keeping the standard ASGD mechanism, we recover the correct objective by rescaling worker-specific stepsizes in proportion to their computation times, so that each worker contributes the same aggregate learning rate over a cycle. In the non-convex setting, under smoothness and bounded heterogeneity assumptions, we prove that the resulting method, Rescaled ASGD, converges to stationary points of the correct global objective in the fixed-computation model. Its time complexity matches the known lower bound in the leading term, while the effects of staleness and data heterogeneity appear only in lower-order terms. Experiments confirm that the method converges to the correct objective and is competitive with state-of-the-art baselines.
翻译:异步随机梯度下降(ASGD)是利用分布式学习中异构计算资源的标准方法:不同于强制快速节点等待慢速节点,服务器在每次收到梯度时即更新模型。原始ASGD对每个到达的梯度赋予相同权重。当局部数据分布存在异构性时,这一机制会引发问题:快速节点贡献更多更新,且我们从理论上证明,该方法会偏向于局部目标的频率加权平均,而非期望的全局目标。现有解决方案通常通过引入聚合阶段、缓冲或额外内存来偏离简单的ASGD范式。我们证明这种调整并无必要。在保持标准ASGD机制的前提下,我们通过按计算时间比例缩放各工作节点的步长来恢复正确目标,使得每个工作节点在一个周期内贡献相同的聚合学习率。在非凸场景中,基于光滑性和有界异构性假设,我们证明所提出的方法"缩放ASGD"在固定计算模型下能够收敛至正确全局目标的驻点。其时间复杂度的主导项与已知下界匹配,而延迟与数据异构性的影响仅体现在低阶项中。实验表明,该方法可收敛至正确目标,并与当前最优基线方法具有竞争力。