Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce $\textbf{F}$orward $\textbf{G}$radient $\textbf{U}$nrolling with $\textbf{F}$orward $\textbf{F}$radient, abbreviated as $(\textbf{FG})^2\textbf{U}$, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. $(\text{FG})^2\text{U}$ circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, $(\text{FG})^2\text{U}$ is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, $(\text{FG})^2\text{U}$ and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, $(\text{FG})^2\text{U}$ is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for $(\text{FG})^2\text{U}$, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks. Code is available at https://github.com/ShenQianli/FG2U.

翻译：双层优化已成为解决分层机器学习问题的基本数学框架。随着深度学习模型规模的持续增长，对可扩展双层优化解决方案的需求变得日益关键。传统的基于梯度的双层优化算法，由于其固有特性，难以满足大规模应用的需求。本文提出基于前向导数的前向梯度展开方法，简称为$(\textbf{FG})^2\textbf{U}$，该方法实现了双层优化中元梯度的无偏随机近似。$(\text{FG})^2\text{U}$规避了经典双层优化方法中存在的内存与近似问题，相比现有的大规模双层优化方法能提供显著更精确的梯度估计。此外，$(\text{FG})^2\text{U}$在设计上天然支持并行计算，使其能够有效利用大规模分布式计算系统实现显著的计算效率提升。在实际应用中，$(\text{FG})^2\text{U}$与其他方法可策略性地部署在训练过程的不同阶段，形成更具成本效益的两阶段范式。进一步地，$(\text{FG})^2\text{U}$易于在主流深度学习框架中实现，并能方便地适配更具挑战性的零阶双层优化场景。我们为$(\text{FG})^2\text{U}$提供了完整的收敛性分析和全面的实践讨论，辅以广泛的实证评估，展示了其在多样化大规模双层优化任务中的卓越性能。代码发布于 https://github.com/ShenQianli/FG2U。