Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being popular choices for cycling through random or single permutations of the training data. However, the convergence properties of these algorithms in the non-convex case are not fully understood. Existing results suggest that, in realistic training scenarios where the number of epochs is smaller than the training set size, RR may perform worse than SGD. In this paper, we analyze a general SGD algorithm that allows for arbitrary data orderings and show improved convergence rates for non-convex functions. Specifically, our analysis reveals that SGD with random and single shuffling is always faster or at least as good as classical SGD with replacement, regardless of the number of iterations. Overall, our study highlights the benefits of using SGD with random/single shuffling and provides new insights into its convergence properties for non-convex optimization.
翻译:随机梯度下降(SGD)算法广泛应用于神经网络优化,其中随机重排(RR)和单次洗牌(SS)是遍历训练数据的随机排列或单次排列的常用选择。然而,这些算法在非凸情况下的收敛性质尚未完全明确。现有结果表明,在实际训练场景中(当轮数小于训练集大小时),RR 可能比 SGD 表现更差。本文分析了一种允许任意数据顺序的通用 SGD 算法,并展示了非凸函数的改进收敛速度。具体而言,我们的分析揭示,无论迭代次数如何,采用随机或单次洗牌的 SGD 始终比经典的有放回 SGD 更快或至少同样优秀。总体而言,本研究强调了使用随机/单次洗牌 SGD 的优势,并为非凸优化中其收敛性质提供了新的见解。