Data-parallel SGD is the de facto algorithm for distributed optimization, especially for large scale machine learning. Despite its merits, communication bottleneck is one of its persistent issues. Most compression schemes to alleviate this either assume noiseless communication links, or fail to achieve good performance on practical tasks. In this paper, we close this gap and introduce LASER: LineAr CompreSsion in WirEless DistRibuted Optimization. LASER capitalizes on the inherent low-rank structure of gradients and transmits them efficiently over the noisy channels. Whilst enjoying theoretical guarantees similar to those of the classical SGD, LASER shows consistent gains over baselines on a variety of practical benchmarks. In particular, it outperforms the state-of-the-art compression schemes on challenging computer vision and GPT language modeling tasks. On the latter, we obtain $50$-$64 \%$ improvement in perplexity over our baselines for noisy channels.
翻译:数据并行SGD是分布式优化领域的事实标准算法,尤其适用于大规模机器学习。尽管其优势显著,通信瓶颈始终是亟待解决的关键问题。现有的大多数压缩方案要么假设通信链路无噪声,要么在实际任务中难以取得理想性能。本文通过提出LASER(无线分布式优化中的线性压缩)填补了这一空白。LASER充分利用梯度的固有低秩结构,通过噪声信道高效传输梯度。在保持与传统SGD相似理论保障的同时,LASER在多个实际基准测试中展现出持续优于基线方法的性能。特别是在具有挑战性的计算机视觉和GPT语言建模任务中,LASER显著超越现有最先进的压缩方案。对于GPT语言建模任务,我们在噪声信道条件下相较于基线方法获得了$50$-$64\%$的困惑度改善。