The increasing size of deep learning models has created the need for more efficient alternatives to the standard error backpropagation algorithm, that make better use of asynchronous, parallel and distributed computing. One major shortcoming of backpropagation is the interlocking between the forward phase of the algorithm, which computes a global loss, and the backward phase where the loss is backpropagated through all layers to compute the gradients, which are used to update the network parameters. To address this problem, we propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads. Furthermore, since we observe that the forward pass is often much faster than the backward pass, we use separate threads for the forward and backward pass calculations, which allows us to use a higher ratio of forward to backward threads than the usual 1:1 ratio, reducing the overall staleness of the parameters. Thus, our approach performs asynchronous stochastic gradient descent using separate threads for the loss (forward) and gradient (backward) computations and performs layer-wise partial updates to parameters in a distributed way. We show that this approach yields close to state-of-the-art results while running up to 2.97x faster than Hogwild! scaled on multiple devices (Locally-Partitioned-Asynchronous-Parallel SGD). We theoretically prove the convergence of the algorithm using a novel theoretical framework based on stochastic differential equations and the drift diffusion process, by modeling the asynchronous parameter updates as a stochastic process.
翻译:深度学习模型规模的日益增长,对标准误差反向传播算法提出了更高效率替代方案的需求,以更好地利用异步、并行和分布式计算。反向传播算法的一个主要缺陷在于其前向阶段(计算全局损失)与反向阶段(将损失通过所有层反向传播以计算梯度,并用于更新网络参数)之间存在相互锁定。为解决该问题,我们提出一种方法,通过多线程异步更新模型各层参数,实现跨层并行化SGD更新。此外,由于我们观察到前向传播通常远快于反向传播,我们使用独立线程分别处理前向与反向传播计算,这使得我们可以采用高于常规1:1比例的前向/反向线程配比,从而降低参数的整体陈旧度。因此,本方法采用独立线程分别执行损失(前向)与梯度(反向)计算,并以分布式方式对参数执行分层部分更新,从而实现异步随机梯度下降。实验表明,该方法在取得接近最先进结果的同时,相较于在多设备上扩展的Hogwild!(局部划分异步并行SGD)实现了高达2.97倍的加速。通过将异步参数更新建模为随机过程,我们基于随机微分方程与漂移扩散过程构建了新颖的理论框架,从理论上证明了该算法的收敛性。