We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, which we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm -- on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues. For noiseless targets, we further demonstrate that the AdaGrad-Norm learning rate converges to a deterministic constant inversely proportional to the average eigenvalue of the data covariance matrix, and identify a phase transition when the covariance density of eigenvalues follows a power law distribution. We provide our code for evaluation at https://github.com/amackenzie1/highline2024.
翻译:我们建立了一个分析框架,用于研究一大类高维优化问题在训练过程中的学习率动态,这类问题我们称之为高线问题,采用单次随机梯度下降(SGD)与自适应学习率进行训练。我们给出了风险和学习率曲线的精确表达式,这些表达式由一个确定性常微分方程组的解所描述。随后,我们针对最小二乘问题,详细研究了两种自适应学习率策略——一种理想化的精确线搜索以及AdaGrad-Norm。当数据协方差矩阵具有严格正的特征值时,与采用最优固定学习率的SGD相比,这种理想化的精确线搜索策略可能表现出任意缓慢的收敛速度。此外,在数据协方差仅有两个不同特征值的设定下,我们精确刻画了线搜索策略的极限学习率(当时间趋于无穷时)。对于无噪声目标,我们进一步证明了AdaGrad-Norm学习率会收敛到一个确定性常数,该常数与数据协方差矩阵特征值的平均值成反比,并且当特征值的协方差密度服从幂律分布时,我们识别出了一个相变现象。我们的评估代码发布于 https://github.com/amackenzie1/highline2024。