We propose a novel algorithm for distributed stochastic gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit stochastic gradient descent (FO-SGD), relies on two simple algorithmic ideas: (i) a one-bit quantization procedure leveraging the technique of dithering, and (ii) a randomized fast Walsh-Hadamard transform to flatten the stochastic gradient before quantization. As a result, the approximation of the true gradient in this scheme is biased, but it prevents commonly encountered algorithmic problems, such as exploding variance in the one-bit compression regime, deterioration of performance in the case of sparse gradients, and restrictive assumptions on the distribution of the stochastic gradients. In fact, we show SGD-like convergence guarantees under mild conditions. The compression technique can be used in both directions of worker-server communication, therefore admitting distributed optimization with full communication compression.
翻译:我们提出了一种新颖的分布式随机梯度下降(SGD)算法,用于参数服务器框架下梯度压缩通信。我们的梯度压缩技术,命名为扁平化一位随机梯度下降(FO-SGD),依赖于两个简单的算法思想:(i)利用抖动技术的一位量化过程,(ii)随机快速沃尔什-哈达玛变换以在量化前扁平化随机梯度。由此,该方案中对真实梯度的逼近是有偏的,但避免了常见算法问题,例如一位压缩机制下的方差爆炸、稀疏梯度情形下性能恶化,以及对随机梯度分布的严格假设。事实上,我们在温和条件下展示了类似SGD的收敛保证。该压缩技术可用于工作节点与服务器间的双向通信,因此实现了全通信压缩的分布式优化。