This paper investigates a class of stochastic bilevel optimization problems where the upper-level function is nonconvex with potentially unbounded smoothness and the lower-level problem is strongly convex. These problems have significant applications in sequential data learning, such as text classification using recurrent neural networks. The unbounded smoothness is characterized by the smoothness constant of the upper-level function scaling linearly with the gradient norm, lacking a uniform upper bound. Existing state-of-the-art algorithms require $\widetilde{O}(1/\epsilon^4)$ oracle calls of stochastic gradient or Hessian/Jacobian-vector product to find an $\epsilon$-stationary point. However, it remains unclear if we can further improve the convergence rate when the assumptions for the function in the population level also hold for each random realization almost surely. To address this issue, we propose a new Accelerated Bilevel Optimization algorithm named AccBO. The algorithm updates the upper-level variable by normalized stochastic gradient descent with recursive momentum and the lower-level variable by the stochastic Nesterov accelerated gradient descent algorithm with averaging. We prove that our algorithm achieves an oracle complexity of $\widetilde{O}(1/\epsilon^3)$ to find an $\epsilon$-stationary point, when the lower-level stochastic gradient's variance is $O(\epsilon)$. Our proof relies on a novel lemma characterizing the dynamics of stochastic Nesterov accelerated gradient descent algorithm under distribution drift with high probability for the lower-level variable, which is of independent interest and also plays a crucial role in analyzing the hypergradient estimation error over time. Experimental results on various tasks confirm that our proposed algorithm achieves the predicted theoretical acceleration and significantly outperforms baselines in bilevel optimization.
翻译:本文研究一类随机双层优化问题,其中上层函数为非凸函数且可能具有无界光滑性,而下层问题为强凸问题。此类问题在序列数据学习(如使用循环神经网络的文本分类)中具有重要应用。无界光滑性的特征在于上层函数的光滑常数随梯度范数线性缩放,缺乏一致上界。现有最先进算法需要 $\widetilde{O}(1/\epsilon^4)$ 次随机梯度或Hessian/Jacobian-向量积的预言机调用才能找到 $\epsilon$-稳定点。然而,当总体层面函数的假设条件几乎必然对每个随机实现也成立时,我们能否进一步改进收敛速率仍不明确。为解决此问题,我们提出一种名为AccBO的新型加速双层优化算法。该算法通过带递归动量的归一化随机梯度下降更新上层变量,并通过带平均的随机Nesterov加速梯度下降算法更新下层变量。我们证明当下层随机梯度方差为 $O(\epsilon)$ 时,该算法仅需 $\widetilde{O}(1/\epsilon^3)$ 的预言机复杂度即可找到 $\epsilon$-稳定点。我们的证明依赖于一个新引理,该引理以高概率刻画了下层变量在分布漂移下随机Nesterov加速梯度下降算法的动态特性,该引理具有独立价值,并在分析超梯度估计误差随时间变化过程中起关键作用。多个任务的实验结果证实,我们提出的算法实现了预测的理论加速效果,并显著优于双层优化中的基线方法。