Gradient descent and its variants are de facto standard algorithms for training machine learning models. As gradient descent is sensitive to its hyperparameters, we need to tune the hyperparameters carefully using a grid search. However, the method is time-consuming, particularly when multiple hyperparameters exist. Therefore, recent studies have analyzed parameter-free methods that adjust the hyperparameters on the fly. However, the existing work is limited to investigations of parameter-free methods for the stepsize, and parameter-free methods for other hyperparameters have not been explored. For instance, although the gradient clipping threshold is a crucial hyperparameter in addition to the stepsize for preventing gradient explosion issues, none of the existing studies have investigated parameter-free methods for clipped gradient descent. Therefore, in this study, we investigate the parameter-free methods for clipped gradient descent. Specifically, we propose Inexact Polyak Stepsize, which converges to the optimal solution without any hyperparameters tuning, and its convergence rate is asymptotically independent of $L$ under $L$-smooth and $(L_0, L_1)$-smooth assumptions of the loss function, similar to that of clipped gradient descent with well-tuned hyperparameters. We numerically validated our convergence results using a synthetic function and demonstrated the effectiveness of our proposed methods using LSTM, Nano-GPT, and T5.
翻译:梯度下降及其变体是训练机器学习模型的事实标准算法。由于梯度下降对超参数敏感,我们需要通过网格搜索仔细调整超参数。然而,该方法耗时严重,特别是在存在多个超参数的情况下。因此,近期研究开始分析能够动态调整超参数的参数无关方法。然而,现有工作仅限于对步长的参数无关方法进行研究,其他超参数的参数无关方法尚未得到探索。例如,尽管梯度裁剪阈值与步长同属防止梯度爆炸问题的关键超参数,但现有研究均未探讨梯度裁剪下降的参数无关方法。为此,本研究系统探究了梯度裁剪下降的参数无关方法。具体而言,我们提出了不精确Polyak步长法,该方法无需任何超参数调优即可收敛至最优解,且在损失函数满足$L$-光滑及$(L_0, L_1)$-光滑假设条件下,其收敛速率渐近独立于$L$——这与经过精细调参的梯度裁剪下降具有相同特性。我们通过合成函数数值验证了收敛结论,并借助LSTM、Nano-GPT和T5模型展示了所提方法的有效性。