Distributed adaptive stochastic gradient methods have been widely used for large-scale nonconvex optimization, such as training deep learning models. However, their communication complexity on finding $\varepsilon$-stationary points has rarely been analyzed in the nonconvex setting. In this work, we present a novel communication-efficient distributed Adam in the parameter-server model for stochastic nonconvex optimization, dubbed {\em Efficient-Adam}. Specifically, we incorporate a two-way quantization scheme into Efficient-Adam to reduce the communication cost between the workers and server. Simultaneously, we adopt a two-way error feedback strategy to reduce the biases caused by the two-way quantization on both the server and workers, respectively. In addition, we establish the iteration complexity for the proposed Efficient-Adam with a class of quantization operators, and further characterize its communication complexity between the server and workers when an $\varepsilon$-stationary point is achieved. Finally, we apply Efficient-Adam to solve a toy stochastic convex optimization problem and train deep learning models on real-world vision and language tasks. Extensive experiments together with a theoretical guarantee justify the merits of Efficient Adam.
翻译:分布式自适应随机梯度方法已广泛用于大规模非凸优化,例如深度学习模型的训练。然而,在非凸场景下,关于寻找$\varepsilon$-驻点的通信复杂度鲜有分析。本文提出了一种新颖的通信高效的分布式Adam算法,适用于参数服务器模型下的随机非凸优化,称之为{\em Efficient-Adam}。具体而言,我们引入了一种双向量化方案到Efficient-Adam中,以减少工作节点与服务器间的通信开销。同时,采用双向误差反馈策略,分别降低服务器和工作节点上由双向量化引起的偏差。此外,我们针对一类量化算子建立了所提Efficient-Adam的迭代复杂度,并进一步刻画了当达到$\varepsilon$-驻点时服务器与工作节点间的通信复杂度。最后,我们将Efficient-Adam应用于解决一个简单的随机凸优化问题,并在真实世界的视觉与语言任务上训练深度学习模型。大量实验与理论保证共同验证了Efficient-Adam的优势。