In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.
翻译:本文提出DeMuon,一种在给定通信拓扑结构下实现去中心化矩阵优化的方法。DeMuon通过牛顿-舒尔茨迭代实现矩阵正交化(该技术继承自其集中式前身Muon),并利用梯度追踪机制缓解局部函数间的异质性。在重尾噪声条件及额外温和假设下,我们建立了DeMuon达到近似随机驻点所需的迭代复杂度。该复杂度结果在目标容差依赖性方面与集中式算法已知最优复杂度界相匹配。据我们所知,DeMuon是首个将Muon直接扩展至图结构去中心化优化并具备可证复杂度保证的方法。我们针对不同连通程度的图结构进行了去中心化Transformer预训练初步数值实验。数值结果表明,在不同网络拓扑下,DeMuon相较于其他主流去中心化算法均展现出显著的性能提升幅度。