Recently, the stochastic Polyak step size (SPS) has emerged as a competitive adaptive step size scheme for stochastic gradient descent. Here we develop ProxSPS, a proximal variant of SPS that can handle regularization terms. Developing a proximal variant of SPS is particularly important, since SPS requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, ProxSPS only requires a lower bound for the loss which is often readily available. As a consequence, we show that ProxSPS is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, ProxSPS performs as well as AdamW with little to no tuning, and results in a network with smaller weight parameters. We also provide an extensive convergence analysis for ProxSPS that includes the non-smooth, smooth, weakly convex and strongly convex setting.
翻译:近期,随机Polyak步长(SPS)已成为随机梯度下降中一种具有竞争力的自适应步长方案。本文提出了ProxSPS——SPS的近端变体,可处理正则化项。由于SPS需要目标函数的下界才能有效工作,因此开发其近端变体尤为重要。当目标函数是损失项与正则项之和时,可用的和函数下界估计可能较为宽松。相比之下,ProxSPS仅需损失项的下界,而该下界通常易于获取。我们证明,ProxSPS在存在正则化时更易调节且更稳定。此外,在图像分类任务中,ProxSPS几乎无需调参即可达到与AdamW相当的性能,并产生权重参数更小的网络结构。我们还为ProxSPS提供了全面的收敛性分析,涵盖非光滑、光滑、弱凸及强凸等情形。