Gradient Balancing (GraB) is a recently proposed technique that finds provably better data permutations when training models with multiple epochs over a finite dataset. It converges at a faster rate than the widely adopted Random Reshuffling, by minimizing the discrepancy of the gradients on adjacently selected examples. However, GraB only operates under critical assumptions such as small batch sizes and centralized data, leaving open the question of how to order examples at large scale -- i.e. distributed learning with decentralized data. To alleviate the limitation, in this paper we propose D-GraB that involves two novel designs: (1) $\textsf{PairBalance}$ that eliminates the requirement to use stale gradient mean in GraB which critically relies on small learning rates; (2) an ordering protocol that runs $\textsf{PairBalance}$ in a distributed environment with negligible overhead, which benefits from both data ordering and parallelism. We prove D-GraB enjoys linear speed up at rate $\tilde{O}((mnT)^{-2/3})$ on smooth non-convex objectives and $\tilde{O}((mnT)^{-2})$ under PL condition, where $n$ denotes the number of parallel workers, $m$ denotes the number of examples per worker and $T$ denotes the number of epochs. Empirically, we show on various applications including GLUE, CIFAR10 and WikiText-2 that D-GraB outperforms naive parallel GraB and Distributed Random Reshuffling in terms of both training and validation performance.
翻译:梯度平衡(GraB)是一种近期提出的技术,能在有限数据集上进行多轮训练时找到经理论证明更优的数据排列。通过最小化相邻样本梯度的差异,其收敛速度优于广泛采用的随机重排技术。然而,GraB仅在关键假设下有效,包括小批量规模和集中式数据,这留下了如何在分布式学习的大规模场景中——即数据去中心化环境——有效编排样本顺序的问题。为克服这一局限,本文提出D-GraB,包含两项创新设计:(1)$\textsf{PairBalance}$,消除了GraB中依赖陈旧梯度均值(该机制严重依赖小学习率)的需求;(2)一种排序协议,可在分布式环境中以极小开销运行$\textsf{PairBalance}$,同时受益于数据排序与并行计算。我们证明D-GraB在光滑非凸目标上享有$\tilde{O}((mnT)^{-2/3})$的线性加速比,在PL条件下可达$\tilde{O}((mnT)^{-2})$,其中$n$表示并行工作节点数,$m$表示每个工作节点的样本数,$T$表示训练轮数。实验表明,在GLUE、CIFAR10和WikiText-2等多种应用中,D-GraB在训练和验证性能上均优于朴素并行GraB和分布式随机重排。