Muon is an optimizer that computes updates using the polar factor of the momentum matrix and has shown strong empirical performance across a range of training settings. A key component of Muon is the Newton-Schulz iteration used to compute this polar factor. Although this avoids the cost of an exact singular value decomposition, it remains expensive in practice because it is applied at every optimization step. At the same time, the momentum matrix changes smoothly over training, suggesting strong temporal correlation in the corresponding polar factors. In this paper, we exploit this structure and propose CacheMuon, a temporal preconditioning method that reuses information from previous optimization steps to approximate the polar factor at the current step. This reduces redundant orthogonalization computation across iterations. We analyze CacheMuon as an inexact Muon update, with error controlled by fresh-solver error and cache staleness. Empirically, CacheMuon provides a controllable quality-efficiency frontier: conservative thresholds closely match fresh Muon on language-model and vision training while reducing orthogonalization FLOPs, whereas more aggressive thresholds yield larger arithmetic savings at the cost of modest validation-quality degradation.
翻译:Muon是一种通过动量矩阵的极分解因子计算更新的优化器,在多种训练场景中展现了优异的实证性能。该优化器的关键组成部分是用于计算极分解因子的Newton-Schulz迭代。尽管这种方法避免了精确奇异值分解的计算开销,但由于需要在每个优化步骤中执行,实际应用中仍然代价高昂。同时,动量矩阵在训练过程中平滑变化,表明对应的极分解因子存在强时间相关性。本文利用这一结构特性,提出CacheMuon——一种时间预条件方法,通过重用先前优化步骤的信息来近似当前步骤的极分解因子,从而减少跨迭代的冗余正交化计算。我们将CacheMuon分析为非精确Muon更新,其误差由新鲜求解器误差和缓存陈旧度共同控制。实验表明,CacheMuon提供了可调控的质量效率边界:保守阈值下,在语言模型和视觉训练任务中与原始Muon高度一致,同时降低正交化计算量;而更激进的阈值则在牺牲适度验证质量损失的前提下,实现更大的算术运算节省。