The graduated optimization approach is a heuristic method for finding globally optimal solutions for nonconvex functions and has been theoretically analyzed in several studies. This paper defines a new family of nonconvex functions for graduated optimization, discusses their sufficient conditions, and provides a convergence analysis of the graduated optimization algorithm for them. It shows that stochastic gradient descent (SGD) with mini-batch stochastic gradients has the effect of smoothing the function, the degree of which is determined by the learning rate and batch size. This finding provides theoretical insights on why large batch sizes fall into sharp local minima, why decaying learning rates and increasing batch sizes are superior to fixed learning rates and batch sizes, and what the optimal learning rate scheduling is. To the best of our knowledge, this is the first paper to provide a theoretical explanation for these aspects. Moreover, a new graduated optimization framework that uses a decaying learning rate and increasing batch size is analyzed and experimental results of image classification that support our theoretical findings are reported.
翻译:递进优化方法是一种寻找非凸函数全局最优解的启发式方法,已在多项研究中得到理论分析。本文定义了一类新的适用于递进优化的非凸函数族,讨论了其充分条件,并给出了递进优化算法对此类函数的收敛性分析。研究表明,使用小批量随机梯度的随机梯度下降(SGD)具有平滑函数的效果,其平滑程度由学习率和批量大小决定。这一发现为以下问题提供了理论洞见:为何大批量会陷入尖锐局部极小点,为何衰减学习率和递增批量大小优于固定学习率和批量大小,以及最优学习率调度方案是什么。据我们所知,本文是首个为这些方面提供理论解释的工作。此外,本文分析了一种采用衰减学习率和递增批量大小的新型递进优化框架,并报告了支持我们理论发现的图像分类实验结果。