Shuffling gradient methods, which are also known as stochastic gradient descent (SGD) without replacement, are widely implemented in practice, particularly including three popular algorithms: Random Reshuffle (RR), Shuffle Once (SO), and Incremental Gradient (IG). Compared to the empirical success, the theoretical guarantee of shuffling gradient methods was not well-understanding for a long time. Until recently, the convergence rates had just been established for the average iterate for convex functions and the last iterate for strongly convex problems (using squared distance as the metric). However, when using the function value gap as the convergence criterion, existing theories cannot interpret the good performance of the last iterate in different settings (e.g., constrained optimization). To bridge this gap between practice and theory, we prove last-iterate convergence rates for shuffling gradient methods with respect to the objective value even without strong convexity. Our new results either (nearly) match the existing last-iterate lower bounds or are as fast as the previous best upper bounds for the average iterate.
翻译:随机重排梯度方法(亦称无放回随机梯度下降)在实践中被广泛采用,主要包括三种经典算法:随机重排(RR)、单次重排(SO)和增量梯度(IG)。相较于其在实践中的成功应用,这类方法的理论保障长期以来未能得到充分理解。直至近期,学者们才分别建立了凸函数的平均迭代收敛率以及强凸问题(以平方距离为度量)的末次迭代收敛率。然而,当采用函数值差距作为收敛准则时,现有理论无法解释末次迭代在不同场景(如约束优化)中的优异表现。为弥合理论与实践的鸿沟,本文证明了随机重排梯度方法在目标函数值意义上的末次迭代收敛率——即便在非强凸条件下。我们的新结果在性能上或(近乎)匹配现有末次迭代下界,或达到此前平均迭代最优上界的量级。