Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being popular choices for cycling through random or single permutations of the training data. However, the convergence properties of these algorithms in the non-convex case are not fully understood. Existing results suggest that, in realistic training scenarios where the number of epochs is smaller than the training set size, RR may perform worse than SGD. In this paper, we analyze a general SGD algorithm that allows for arbitrary data orderings and show improved convergence rates for non-convex functions. Specifically, our analysis reveals that SGD with random and single shuffling is always faster or at least as good as classical SGD with replacement, regardless of the number of iterations. Overall, our study highlights the benefits of using SGD with random/single shuffling and provides new insights into its convergence properties for non-convex optimization.
翻译:随机梯度下降(SGD)算法广泛应用于神经网络优化,其中随机重排(RR)和单次洗牌(SS)是常用的数据遍历策略,分别通过随机或单一排列顺序循环使用训练数据。然而,这些算法在非凸情形下的收敛性质尚未得到充分理解。现有结果表明,在实际训练场景中(当训练轮数小于训练集大小时),RR的表现可能劣于标准SGD。本文分析了一种允许任意数据顺序的通用SGD算法,并证明了其在非凸函数上的改进收敛速率。具体而言,我们的分析揭示:无论迭代次数如何,采用随机或单一洗牌的SGD算法始终比经典的有放回SGD更快或至少相当。总体而言,本研究凸显了使用随机/单一洗牌SGD的优势,并为非凸优化中其收敛特性提供了新见解。