In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.
翻译:在现代机器学习中,训练并行化是扩展规模的重要策略。异步随机梯度下降(ASGD)通过避免等待慢速工作节点来最大化硬件利用率。然而,在固定步长下,慢速工作节点带来的大更新延迟仍会负面影响ASGD的收敛性。同时,在深度学习模型的异步训练中,实验观察到梯度裁剪能"稳定"训练过程。本文为这一现象提供了理论依据,证明裁剪消除了Oracle复杂度中的最大延迟依赖性。我们采用子韦伯模型来描述梯度噪声,该模型将次高斯分布和次指数分布推广至更具重尾特性的分布,这一建模基于深度学习中的实验观察。我们证明了期望意义上的收敛性,并在异步优化中首次实现高概率收敛。