Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as continual learning, neural architecture search, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. In this paper, we introduce a bilevel stochastic gradient method for bilevel problems with nonlinear and possibly nonconvex lower-level constraints. We also present a comprehensive convergence theory that addresses both the lower-level unconstrained and constrained cases and covers all inexact calculations of the adjoint gradient (also called hypergradient), such as the inexact solution of the lower-level problem, inexact computation of the adjoint formula (due to the inexact solution of the adjoint equation or use of a truncated Neumann series), and noisy estimates of the gradients, Hessians, and Jacobians involved. To promote the use of bilevel optimization in large-scale learning, we have developed new low-rank practical bilevel stochastic gradient methods (BSG-N-FD and~BSG-1) that do not require second-order derivatives and, in the lower-level unconstrained case, dismiss any matrix-vector products.
翻译:双层随机优化公式已成为持续学习、神经架构搜索、对抗学习及超参数调优等机器学习场景中的关键工具。实际的双层随机优化问题在变量数量高或存在约束的优化/学习场景中面临挑战。本文针对具有非线性且可能非凸下层约束的双层问题,提出了一种双层随机梯度方法。我们同时构建了涵盖下层无约束与有约束情形的全面收敛理论,该理论覆盖伴随梯度(亦称超梯度)的所有不精确计算方式,包括下层问题的不精确求解、伴随公式的不精确计算(源于伴随方程的非精确解或截断诺伊曼级数的使用)、以及梯度、海森矩阵和雅可比矩阵的噪声估计。为促进大规模学习中双层优化的应用,我们开发了无需二阶导数的新型低秩实用双层随机梯度方法(BSG-N-FD 和 BSG-1),在无下层约束情况下完全无需矩阵-向量乘积运算。