This paper investigates a class of stochastic bilevel optimization problems where the upper-level function is nonconvex with potentially unbounded smoothness and the lower-level problem is strongly convex. These problems have significant applications in sequential data learning, such as text classification using recurrent neural networks. The unbounded smoothness is characterized by the smoothness constant of the upper-level function scaling linearly with the gradient norm, lacking a uniform upper bound. Existing state-of-the-art algorithms require $\widetilde{O}(1/\epsilon^4)$ oracle calls of stochastic gradient or Hessian/Jacobian-vector product to find an $\epsilon$-stationary point. However, it remains unclear if we can further improve the convergence rate when the assumptions for the function in the population level also hold for each random realization almost surely (e.g., Lipschitzness of each realization of the stochastic gradient). To address this issue, we propose a new Accelerated Bilevel Optimization algorithm named AccBO. The algorithm updates the upper-level variable by normalized stochastic gradient descent with recursive momentum and the lower-level variable by the stochastic Nesterov accelerated gradient descent algorithm with averaging. We prove that our algorithm achieves an oracle complexity of $\widetilde{O}(1/\epsilon^3)$ to find an $\epsilon$-stationary point. Our proof relies on a novel lemma characterizing the dynamics of stochastic Nesterov accelerated gradient descent algorithm under distribution drift with high probability for the lower-level variable, which is of independent interest and also plays a crucial role in analyzing the hypergradient estimation error over time. Experimental results on various tasks confirm that our proposed algorithm achieves the predicted theoretical acceleration and significantly outperforms baselines in bilevel optimization.
翻译:本文研究一类随机双层优化问题,其中上层函数为非凸函数且可能具有无界光滑性,而下层问题为强凸问题。这类问题在序列数据学习中具有重要应用,例如使用循环神经网络进行文本分类。无界光滑性的特征在于上层函数的光滑常数随梯度范数线性增长,缺乏一致上界。现有最先进算法需要 $\widetilde{O}(1/\epsilon^4)$ 次随机梯度或 Hessian/Jacobian-向量积的预言机调用才能找到一个 $\epsilon$-平稳点。然而,当总体层面函数的假设条件几乎必然地适用于每个随机实现时(例如随机梯度每个实现的 Lipschitz 连续性),我们是否能够进一步改进收敛速率仍不明确。为解决此问题,我们提出了一种名为 AccBO 的新型加速双层优化算法。该算法通过带递归动量的归一化随机梯度下降更新上层变量,并通过带平均的随机 Nesterov 加速梯度下降算法更新下层变量。我们证明该算法达到 $\widetilde{O}(1/\epsilon^3)$ 的预言机复杂度以找到 $\epsilon$-平稳点。我们的证明依赖于一个新的引理,该引理以高概率刻画了下层变量在分布漂移下随机 Nesterov 加速梯度下降算法的动态特性,这一结果具有独立意义,并在分析超梯度估计误差随时间变化的过程中起到关键作用。多个任务上的实验结果证实,我们提出的算法实现了预测的理论加速效果,并显著优于双层优化中的基线方法。