Learning to Generalize Provably in Learning to Optimize

Learning to optimize (L2O) has gained increasing popularity, which automates the design of optimizers by data-driven approaches. However, current L2O methods often suffer from poor generalization performance in at least two folds: (i) applying the L2O-learned optimizer to unseen optimizees, in terms of lowering their loss function values (optimizer generalization, or ``generalizable learning of optimizers"); and (ii) the test performance of an optimizee (itself as a machine learning model), trained by the optimizer, in terms of the accuracy over unseen data (optimizee generalization, or ``learning to generalize"). While the optimizer generalization has been recently studied, the optimizee generalization (or learning to generalize) has not been rigorously studied in the L2O context, which is the aim of this paper. We first theoretically establish an implicit connection between the local entropy and the Hessian, and hence unify their roles in the handcrafted design of generalizable optimizers as equivalent metrics of the landscape flatness of loss functions. We then propose to incorporate these two metrics as flatness-aware regularizers into the L2O framework in order to meta-train optimizers to learn to generalize, and theoretically show that such generalization ability can be learned during the L2O meta-training process and then transformed to the optimizee loss function. Extensive experiments consistently validate the effectiveness of our proposals with substantially improved generalization on multiple sophisticated L2O models and diverse optimizees. Our code is available at: https://github.com/VITA-Group/Open-L2O/tree/main/Model_Free_L2O/L2O-Entropy.

翻译：学习优化（L2O）近年来日益流行，它通过数据驱动的方法自动设计优化器。然而，现有的L2O方法在至少两个方面存在泛化性能不佳的问题：（i）将L2O学习到的优化器应用于未见过的被优化对象时，降低其损失函数值的能力（优化器泛化，或“优化器的可泛化学习”）；（ii）被优化对象（其本身作为机器学习模型）在使用该优化器训练后的测试性能，即对未见数据的准确率（被优化对象泛化，或“学习泛化”）。虽然优化器泛化近期已有研究，但在L2O背景下，被优化对象泛化（即学习泛化）尚未得到严格的分析，这正是本文的研究目标。我们首先从理论上建立了局部熵与Hessian矩阵之间的隐式联系，从而统一了它们在手工设计可泛化优化器中的作用，将其视为损失函数景观平坦性的等价度量。随后，我们提出将这两种度量作为感知平坦性的正则化项引入L2O框架，以元训练优化器使其学会泛化，并从理论上证明这种泛化能力可在L2O元训练过程中习得，进而传递至被优化对象的损失函数。大量实验一致验证了我们方法的有效性，在多个复杂的L2O模型和多样化的被优化对象上，泛化性能均得到显著提升。我们的代码已开源：https://github.com/VITA-Group/Open-L2O/tree/main/Model_Free_L2O/L2O-Entropy。