Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings for SGD that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: while it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms distributed RR on a variety of benchmark tasks.
翻译:关于在线梯度平衡(GraB)的最新研究表明,存在基于排列的SGD示例顺序,其性能保证优于随机重排(RR)。RR任意排列训练示例,而GraB利用先前时期的陈旧梯度对示例进行排序——实现了比RR可证明更快的收敛速度。然而,GraB存在设计局限:虽然它在集中式数据上展现出卓越的训练扩展能力,但无法自然地适用于现代分布式机器学习负载。为此,我们提出协调分布式梯度平衡(CD-GraB),该方法借鉴先前关于核稀疏化的研究见解,将基于排列的可证明更快示例顺序优势扩展到分布式场景中。在开销极小的条件下,CD-GraB在收敛速度上相较集中式GraB呈线性加速,并在多种基准任务上超越分布式RR方法。