Previous works on stochastic gradient descent (SGD) often focus on its success. In this work, we construct worst-case optimization problems illustrating that, when not in the regimes that the previous works often assume, SGD can exhibit many strange and potentially undesirable behaviors. Specifically, we construct landscapes and data distributions such that (1) SGD converges to local maxima, (2) SGD escapes saddle points arbitrarily slowly, (3) SGD prefers sharp minima over flat ones, and (4) AMSGrad converges to local maxima. We also realize results in a minimal neural network-like example. Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes to understand the role of SGD in deep learning.
翻译:关于随机梯度下降(SGD)的既有工作通常聚焦于其成功之处。本文构造了最坏情形的优化问题,揭示了当问题不满足既有工作常设的假设条件时,SGD可能展现出诸多反常甚至非期望的行为。具体而言,我们构建了损失景观与数据分布,使得:(1) SGD收敛至局部极大值,(2) SGD以任意慢的速度逃离鞍点,(3) SGD偏好尖锐极小值而非平坦极小值,(4) AMSGrad收敛至局部极大值。我们还在一个极简类神经网络实例中验证了上述结论。本研究结果凸显了同时分析小批量采样、离散时间更新规则和真实损失景观,对于理解SGD在深度学习中的作用至关重要。