Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings for SGD that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: while it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms distributed RR on a variety of benchmark tasks.
翻译:近期关于在线梯度平衡(GraB)的研究表明,存在基于排列的样本顺序用于SGD,这些顺序能保证优于随机重排(RR)。尽管RR任意排列训练样本,GraB利用先前周期的陈旧梯度对样本进行排序——实现了比RR更快的收敛速率,且该优势具有理论保障。然而,GraB受限于其设计特性:尽管它在集中式数据上展现出卓越的扩缩训练能力,但无法自然扩展到现代分布式机器学习工作负载。为此,我们提出协调分布式GraB(CD-GraB),该方法借鉴先前关于核稀疏化的工作,将可证明更快的基于排列的样本顺序优势迁移至分布式场景。CD-GraB在极小开销下,相较于集中式GraB实现了收敛速率的线性加速,并在多种基准任务上优于分布式RR。