Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

The increasing size of deep learning models has created the need for more efficient alternatives to the standard error backpropagation algorithm, that make better use of asynchronous, parallel and distributed computing. One major shortcoming of backpropagation is the interlocking between the forward phase of the algorithm, which computes a global loss, and the backward phase where the loss is backpropagated through all layers to compute the gradients, which are used to update the network parameters. To address this problem, we propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads. Furthermore, since we observe that the forward pass is often much faster than the backward pass, we use separate threads for the forward and backward pass calculations, which allows us to use a higher ratio of forward to backward threads than the usual 1:1 ratio, reducing the overall staleness of the parameters. Thus, our approach performs asynchronous stochastic gradient descent using separate threads for the loss (forward) and gradient (backward) computations and performs layer-wise partial updates to parameters in a distributed way. We show that this approach yields close to state-of-the-art results while running up to 2.97x faster than Hogwild! scaled on multiple devices (Locally-Partitioned-Asynchronous-Parallel SGD). We theoretically prove the convergence of the algorithm using a novel theoretical framework based on stochastic differential equations and the drift diffusion process, by modeling the asynchronous parameter updates as a stochastic process.

翻译：深度学习模型规模的日益增长，对标准误差反向传播算法提出了更高效率替代方案的需求，以更好地利用异步、并行和分布式计算。反向传播算法的一个主要缺陷在于其前向阶段（计算全局损失）与反向阶段（将损失通过所有层反向传播以计算梯度，并用于更新网络参数）之间存在相互锁定。为解决该问题，我们提出一种方法，通过多线程异步更新模型各层参数，实现跨层并行化SGD更新。此外，由于我们观察到前向传播通常远快于反向传播，我们使用独立线程分别处理前向与反向传播计算，这使得我们可以采用高于常规1:1比例的前向/反向线程配比，从而降低参数的整体陈旧度。因此，本方法采用独立线程分别执行损失（前向）与梯度（反向）计算，并以分布式方式对参数执行分层部分更新，从而实现异步随机梯度下降。实验表明，该方法在取得接近最先进结果的同时，相较于在多设备上扩展的Hogwild!（局部划分异步并行SGD）实现了高达2.97倍的加速。通过将异步参数更新建模为随机过程，我们基于随机微分方程与漂移扩散过程构建了新颖的理论框架，从理论上证明了该算法的收敛性。

相关内容

反向传播

关注 354

反向传播一词严格来说仅指用于计算梯度的算法，而不是指如何使用梯度。但是该术语通常被宽松地指整个学习算法，包括如何使用梯度，例如通过随机梯度下降。反向传播将增量计算概括为增量规则中的增量规则，该规则是反向传播的单层版本，然后通过自动微分进行广义化，其中反向传播是反向累积（或“反向模式”）的特例。在机器学习中，反向传播（backprop）是一种广泛用于训练前馈神经网络以进行监督学习的算法。对于其他人工神经网络（ANN）都存在反向传播的一般化–一类算法，通常称为“反向传播”。反向传播算法的工作原理是，通过链规则计算损失函数相对于每个权重的梯度，一次计算一层，从最后一层开始向后迭代，以避免链规则中中间项的冗余计算。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日