Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called "tail-index") in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize in terms of the tail behavior. We illustrate our theory on linear regression experiments and show through deep learning experiments that Markovian stepsizes can achieve even a heavier tail and be a viable alternative to cyclic and i.i.d. randomized stepsize rules.
翻译:循环步长与随机步长在深度学习实践中被广泛使用,且通常能优于标准步长选择(如SGD中的恒定步长)。尽管其经验性成功显著,但关于它们为何以及在何种理论上能提升泛化性能,目前仍知之甚少。本文考虑一类通用的马尔可夫步长用于学习,它包含独立同分布随机步长、循环步长以及作为特殊情形的恒定步长。受相关文献(表明SGD迭代中尾部权重(以所谓“尾指数”衡量)与泛化能力相关)启发,我们研究尾指数,并提供一系列理论结果,证明尾指数如何随步长调度变化。我们的结果从尾部行为角度,为循环步长与随机步长相比恒定步长的优势提供了新的理解。我们在线性回归实验中阐释理论,并通过深度学习实验表明,马尔可夫步长可实现更重的尾部,并成为循环步长与独立同分布随机步长规则的有效替代方案。