Efficient Distributed Auto-Differentiation

Although distributed machine learning has opened up numerous frontiers of research, the separation of large models across different devices, nodes, and sites can invite significant communication overhead, making reliable training difficult. The focus on gradients as the primary shared statistic during training has led to a number of intuitive algorithms for distributed deep learning; however, gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy, often requiring additional modifications via sparsity constraints, compression, quantization, and other similar approaches, to lower bandwidth. We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient. The error backpropagation process can be modified to share these smaller intermediate values instead of the gradient, reducing communication overhead with no impact on accuracy. The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas while leaving room for further bandwidth reduction via existing gradient compression methods. Finally, consideration of the matrices used to compute the gradient inspires a new approach to compression via structured power iterations, which can not only reduce bandwidth but also enable introspection into distributed training dynamics, without significant performance loss.

翻译：尽管分布式机器学习开辟了许多研究领域,但大型模型在不同装置、节点和地点的分离可以吸引大量的通信管理费用,使得可靠的培训变得困难。将梯度作为培训期间的主要共享统计数据,导致了一系列用于分布式深层次学习的直观算法;然而,用于培训大型深神经网络(DNN)的梯度算法是通信重,往往需要通过宽度限制、压缩、量化和其他类似方法对低带宽作更多的修改。我们为分布式数字网络的培训引入出乎意料的简单统计数据,这种统计数据比梯度更方便通信。错误反向调整过程可以修改,以便分享这些较小的中间值,减少通信管理费用,对准确性没有影响。这一过程提供了在后向调整过程中平均梯度的灵活性,使得新的灵活培训模式能够通过现有的梯度压缩方法为进一步减少带宽度留下空间。最后,对用于对梯度进行校正的矩阵的考虑激励了一种新的压缩方法,通过结构化电压不仅能够减少带宽度,而且还能够使内部演化成为分布式的动态,而不会造成重大损失。