Training complex machine learning (ML) architectures requires a compute and time consuming process of selecting the right optimizer and tuning its hyper-parameters. A new paradigm of learning optimizers from data has emerged as a better alternative to hand-designed ML optimizers. We propose Mnemosyne optimizer, that uses Performers: implicit low-rank attention Transformers. It can learn to train entire neural network architectures including other Transformers without any task-specific optimizer tuning. We show that Mnemosyne: (a) generalizes better than popular LSTM optimizer, (b) in particular can successfully train Vision Transformers (ViTs) while meta--trained on standard MLPs and (c) can initialize optimizers for faster convergence in Robotics applications. We believe that these results open the possibility of using Transformers to build foundational optimization models that can address the challenges of regular Transformer training. We complement our results with an extensive theoretical analysis of the compact associative memory used by Mnemosyne.
翻译:训练复杂的机器学习(ML)架构需要耗费大量的计算资源和时间,以选择合适的优化器并调整其超参数。从数据中学习优化器的新范式已成为手工设计ML优化器的更优替代方案。我们提出了Mnemosyne优化器,它采用了Performer:隐式低秩注意力Transformer。该优化器能够学习训练整个神经网络架构,包括其他Transformer,且无需针对特定任务调整优化器参数。我们证明Mnemosyne:(a)泛化能力优于流行的LSTM优化器;(b)特别地,能够在仅以标准MLP进行元训练的条件下成功训练视觉Transformer(ViTs);(c)能够为机器人应用中的优化器初始化以加速收敛。我们相信这些结果为利用Transformer构建基础优化模型开辟了可能性,从而应对常规Transformer训练中的挑战。我们对Mnemosyne所使用的紧凑型联想记忆进行了详尽的理论分析,以此补充实验结果。