Asynchronous stochastic gradient methods are central to scalable distributed optimization, particularly when devices differ in computational capabilities. Such settings arise naturally in federated learning, where training takes place on smartphones and other heterogeneous edge devices. In addition to varying computation speeds, these devices often hold data from different distributions. However, existing asynchronous SGD methods struggle in such heterogeneous settings and face two key limitations. First, many rely on unrealistic assumptions of similarity across workers' data distributions. Second, methods that relax this assumption still fail to achieve theoretically optimal performance under heterogeneous computation times. We introduce Ringleader ASGD, the first asynchronous SGD algorithm that attains the theoretical lower bounds for parallel first-order stochastic methods in the smooth nonconvex regime, thereby achieving optimal time complexity under data heterogeneity and without restrictive similarity assumptions. Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary and even time-varying worker computation speeds, closing a fundamental gap in the theory of asynchronous optimization.
翻译:异步随机梯度方法是可扩展分布式优化的核心,尤其在设备计算能力存在差异时。这种设置自然地出现在联邦学习中,其中训练在智能手机和其他异构边缘设备上进行。除了计算速度不同,这些设备通常持有来自不同分布的数据。然而,现有的异步SGD方法在此类异构环境中面临困难,并存在两个关键局限。首先,许多方法依赖于对工作者数据分布相似性的不切实际假设。其次,放宽此假设的方法在异构计算时间下仍无法达到理论上的最优性能。我们提出了Ringleader ASGD,这是首个在平滑非凸机制下达到并行一阶随机方法理论下界的异步SGD算法,从而在数据异质性且无需严格相似性假设的条件下实现了最优时间复杂度。我们的分析进一步证明,Ringleader ASGD在任意甚至时变的工作者计算速度下仍保持最优性,填补了异步优化理论中的一个根本性空白。