Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent analogous to how Adam and Signum reduce to sign descent, their general relationship and relative data efficiency under controlled settings remain unclear. Through extensive experiments on language models, we demonstrate that Shampoo achieves higher token efficiency than Muon, mirroring Adam's advantage over Signum. We show that Shampoo's update applied to weight matrices can be decomposed into an adapted Muon update. Consistent with this, Shampoo's benefits can be exclusively attributed to its application to weight matrices, challenging interpretations agnostic to parameter shapes. This admits a new perspective that also avoids shortcomings of related interpretations based on variance adaptation and whitening: rather than enforcing semi-orthogonality as in spectral descent, Shampoo's updates are time-averaged semi-orthogonal in expectation.
翻译:利用神经网络中矩阵结构的优化器(如Shampoo和Muon)比Adam和Signum等逐元素算法具有更高的数据效率。虽然在特定设置下,Shampoo和Muon可简化为谱下降法(类似于Adam和Signum简化为符号下降法),但它们在受控设置下的普适关系与相对数据效率仍不明确。通过对语言模型的大量实验,我们证明Shampoo比Muon具有更高的标记效率,这反映了Adam相对于Signum的优势。我们证明应用于权重矩阵的Shampoo更新可分解为适配的Muon更新。与此一致的是,Shampoo的优势完全归因于其对权重矩阵的应用,这对忽略参数形状的解释提出了挑战。这引入了一种新视角,同时避免了基于方差适应和白化的相关解释的缺陷:与谱下降法中强制半正交性不同,Shampoo的更新在期望意义上是时间平均半正交的。