Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which renders them incapable of converging to the true optimum in the batch mode. In this work we propose a new distributed learning method -- DIANA -- which resolves this issue via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are superior to existing rates. We also provide theory to support non-smooth regularizers study the difference between quantization schemes. Our analysis of block-quantization and differences between $\ell_2$ and $\ell_{\infty}$ quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to TernGrad, we establish the first convergence rate for this method.
翻译:训练大规模机器学习模型需要采用分布式计算方法,而模型更新的通信是瓶颈。为此,近期提出了多种基于更新压缩(例如稀疏化和/或量化)的方法,包括QSGD(Alistarh等人,2017)、TernGrad(Wen等人,2017)、SignSGD(Bernstein等人,2018)和DQGD(Khirirat等人,2018)。然而,这些方法均无法学习梯度,导致它们在批处理模式下无法收敛到真实最优解。本文提出一种新的分布式学习方法——DIANA,通过压缩梯度差异解决了这一问题。我们在强凸和非凸设置下进行了理论分析,表明我们的收敛速率优于现有方法。我们还提供了支持非光滑正则化项的理论,并研究了量化方案之间的差异。对块量化以及$\ell_2$和$\ell_\infty$量化差异的分析弥合了理论与实践的差距。最后,通过将我们的分析技术应用于TernGrad,首次建立了该方法的收敛速率。