Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent
翻译:优化是现代深度学习的核心,然而现有方法往往面临适应问题几何结构与利用曲率信息之间的根本性权衡。最速下降算法通过范数选择适应不同几何结构,但严格保持一阶特性;而拟牛顿法和自适应优化器虽能纳入曲率信息,却受限于Frobenius几何,限制了其在多样化架构中的适用性。本文提出一种统一框架,通过预条件矩阵范数这一新概念,将最速下降法、拟牛顿法和自适应方法进行泛化。该抽象框架揭示出SGD和Adam等广泛使用的优化器,以及Muon和KL-Shampoo等更先进方法,乃至SOAP和SPlus等近期混合方法,均作为同一原理的特例显现。在此框架内,我们首次系统处理了矩阵参数化设置下的仿射与尺度不变性,建立了广义范数下的充分必要条件。基于此理论基础,我们提出了两种新方法$\texttt{MuAdam}$和$\texttt{MuAdam-SANIA}$,它们将Muon的谱几何特性与Adam风格的预条件处理相结合。实验表明这些优化器与现有先进方法具有竞争力,并在某些情况下表现更优。代码发布于https://github.com/brain-lab-research/LIB/tree/quasi_descent