Bilevel optimization, the problem of minimizing a value function which involves the arg-minimum of another function, appears in many areas of machine learning. In a large scale empirical risk minimization setting where the number of samples is huge, it is crucial to develop stochastic methods, which only use a few samples at a time to progress. However, computing the gradient of the value function involves solving a linear system, which makes it difficult to derive unbiased stochastic estimates. To overcome this problem we introduce a novel framework, in which the solution of the inner problem, the solution of the linear system, and the main variable evolve at the same time. These directions are written as a sum, making it straightforward to derive unbiased estimates. The simplicity of our approach allows us to develop global variance reduction algorithms, where the dynamics of all variables is subject to variance reduction. We demonstrate that SABA, an adaptation of the celebrated SAGA algorithm in our framework, has $O(\frac1T)$ convergence rate, and that it achieves linear convergence under Polyak-Lojasciewicz assumption. This is the first stochastic algorithm for bilevel optimization that verifies either of these properties. Numerical experiments validate the usefulness of our method.
翻译:双层优化问题,即最小化一个包含另一函数极小值点的值函数,广泛存在于机器学习的诸多领域。在大规模经验风险最小化场景中,样本数量巨大,开发每次仅使用少量样本进行更新的随机方法至关重要。然而,计算值函数的梯度涉及求解线性方程组,这使得推导无偏随机估计变得困难。为克服此问题,我们提出了一种新颖框架,其中内层问题的解、线性方程组的解以及主变量同时演化。这些更新方向被表示为求和形式,从而可以直观地推导无偏估计。我们方法的简洁性使得我们能够开发全局方差缩减算法,其中所有变量的动态均受到方差缩减的影响。我们证明了SABA——即经典SAGA算法在我们框架中的适配版本——具有$O(\frac1T)$的收敛速率,并且在Polyak-Lojasciewicz假设下达到线性收敛。这是首个验证上述任一性质的双层优化随机算法。数值实验验证了我们方法的有效性。