In distributed machine learning, efficient training across multiple agents with different data distributions poses significant challenges. Even with a centralized coordinator, current algorithms that achieve optimal communication complexity typically require either large minibatches or compromise on gradient complexity. In this work, we tackle both centralized and decentralized settings across strongly convex, convex, and nonconvex objectives. We first demonstrate that a basic primal-dual method, (Accelerated) Gradient Ascent Multiple Stochastic Gradient Descent (GA-MSGD), applied to the Lagrangian of distributed optimization inherently incorporates local updates, because the inner loops of running Stochastic Gradient Descent on the primal variable require no inter-agent communication. Notably, for strongly convex objectives, we show (Accelerated) GA-MSGD achieves linear convergence in communication rounds despite the Lagrangian being only linear in the dual variables. This is due to a unique structural property where the dual variable is confined to the span of the coupling matrix, rendering the dual problem strongly concave. When integrated with the Catalyst framework, our approach achieves nearly optimal communication complexity across various settings without the need for minibatches. Moreover, in stochastic decentralized problems, it attains communication complexities comparable to those in deterministic settings, improving over existing algorithms.
翻译:在分布式机器学习中,跨多个具有不同数据分布的智能体进行高效训练提出了重大挑战。即使存在集中式协调器,当前达到最优通信复杂度的算法通常需要较大的小批量数据或在梯度复杂度上做出妥协。本研究针对强凸、凸和非凸目标函数,同时处理集中式和去中心化设置。我们首先证明,应用于分布式优化拉格朗日函数的基本原对偶方法——(加速)梯度上升多重随机梯度下降法(GA-MSGD)——天然地包含了本地更新,因为对原始变量运行随机梯度下降的内循环无需智能体间通信。值得注意的是,对于强凸目标,我们证明(加速)GA-MSGD在通信轮数上实现线性收敛,尽管拉格朗日函数仅在对偶变量中呈线性。这归因于一种独特的结构特性:对偶变量被限制在耦合矩阵的张成空间中,使得对偶问题具有强凹性。当与Catalyst框架结合时,我们的方法无需小批量数据即可在各种设置下实现近乎最优的通信复杂度。此外,在随机去中心化问题中,该方法获得的通信复杂度与确定性设置相当,优于现有算法。