To define a steepest descent method over a neural network, we need to choose a norm for each layer, a way to aggregate these norms across layers, and whether to use normalization. We systematically explore different alternatives for aggregating norms across layers, both formalizing existing combinations of Adam and the recently proposed Muon as a type of non-Euclidean gradient descent, and deriving new variants of the Muon optimizer. Through a comprehensive experimental evaluation of the optimizers within our framework, we find that Muon is sensitive to the choice of learning rate, whereas a new variant we call MuonMax is significantly more robust. We then show how to combine any non-Euclidean gradient method with model based momentum (known as Momo). The new Momo variants of Muon are significantly more robust to hyperparameter tuning, and often achieve a better validation score. Thus for new tasks, where the optimal hyperparameters are not known, we advocate for using Momo in combination with MuonMax to save on costly hyperparameter tuning.
翻译:为在神经网络上定义最速下降法,我们需要为每一层选择范数、确定跨层范数聚合方式,以及决定是否使用归一化。我们系统性地探索了跨层范数聚合的不同方案:既将现有Adam与近期提出的Muon组合形式化为非欧几里得梯度下降的一种类型,又推导出Muon优化器的新变体。通过对框架内优化器的综合实验评估,我们发现Muon对学习率选择敏感,而我们所提出的新变体MuonMax则表现出显著更强的鲁棒性。随后我们展示了如何将任意非欧几里得梯度方法与基于模型的动量方法(称为Momo)相结合。Muon的新Momo变体对超参数调节具有显著更强的鲁棒性,且通常能获得更优的验证分数。因此对于未知最优超参数的新任务,我们建议采用Momo与MuonMax的组合方案,以节省昂贵的超参数调优成本。