Gradient-based multilevel optimization (MLO) has gained attention as a framework for studying numerous problems, ranging from hyperparameter optimization and meta-learning to neural architecture search and reinforcement learning. However, gradients in MLO, which are obtained by composing best-response Jacobians via the chain rule, are notoriously difficult to implement and memory/compute intensive. We take an initial step towards closing this gap by introducing Betty, a software library for large-scale MLO. At its core, we devise a novel dataflow graph for MLO, which allows us to (1) develop efficient automatic differentiation for MLO that reduces the computational complexity from O(d^3) to O(d^2), (2) incorporate systems support such as mixed-precision and data-parallel training for scalability, and (3) facilitate implementation of MLO programs of arbitrary complexity while allowing a modular interface for diverse algorithmic and systems design choices. We empirically demonstrate that Betty can be used to implement an array of MLO programs, while also observing up to 11% increase in test accuracy, 14% decrease in GPU memory usage, and 20% decrease in training wall time over existing implementations on multiple benchmarks. We also showcase that Betty enables scaling MLO to models with hundreds of millions of parameters. We open-source the code at https://github.com/leopard-ai/betty.
翻译:摘要:基于梯度的多层级优化(MLO)作为一种研究框架,已受到广泛关注,其应用范围涵盖超参数优化、元学习、神经架构搜索及强化学习等多个领域。然而,通过链式法则组合最优响应雅可比矩阵所获得的MLO梯度,不仅实现难度极高,还存在内存与计算资源消耗大的问题。为弥合这一差距,我们迈出初步步伐——提出Betty,一个面向大规模MLO的软件库。其核心创新在于设计了一种新型MLO数据流图,该方案可实现:(1)开发高效MLO自动微分算法,将计算复杂度从O(d³)降至O(d²);(2)集成混合精度训练与数据并行训练等系统级扩展支持;(3)在保证模块化接口支持多样化算法与系统设计选择的同时,促进任意复杂度MLO程序的实现。实验表明,Betty可支撑多种MLO程序的开发,相较于现有实现,在多个基准测试中取得了测试准确率最高提升11%、GPU内存使用率降低14%、训练耗时减少20%的效果。此外,Betty成功将MLO扩展至含数十亿参数的模型。代码已开源至https://github.com/leopard-ai/betty。