In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size. In the full-batch setting, we show that the solution is dense (i.e., not sparse) and is highly aligned with its initialized direction, showing that relatively little feature learning occurs. On the other hand, for any batch size strictly smaller than the number of samples, SGD finds a global minimum which is sparse and nearly orthogonal to its initialization, showing that the randomness of stochastic gradients induces a qualitatively different type of "feature selection" in this setting. Moreover, if we measure the sharpness of the minimum by the trace of the Hessian, the minima found with full batch gradient descent are flatter than those found with strictly smaller batch sizes, in contrast to previous works which suggest that large batches lead to sharper minima. To prove convergence of SGD with a constant step size, we introduce a powerful tool from the theory of non-homogeneous random walks which may be of independent interest.
翻译:本文研究在正交数据上训练单神经元自编码器(使用线性或ReLU激活函数)时随机梯度下降(SGD)的动态特性。我们证明,对于这一非凸问题,采用恒定步长的随机初始化SGD在任何批量大小下都能成功找到全局最小值。然而,所找到的全局最小值具体形式取决于批量大小。在全批量设置中,我们证明解是稠密的(即非稀疏),且与初始化方向高度一致,表明此时发生的特征学习相对较少。相反,对于严格小于样本数的任意批量大小,SGD找到的全局最小值具有稀疏性且与初始近似正交,表明随机梯度的随机性在此情况下诱导出性质不同的"特征选择"。此外,若以Hessian矩阵迹衡量最小值锐度,全批量梯度下降找到的最小值比严格较小批量下的最小值更平坦,这与先前认为大批量导致更尖锐最小值的结论相反。为证明恒定步长SGD的收敛性,我们引入非齐次随机游走理论中的一个强大工具,该工具可能具有独立研究价值。