Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still lacking. To bridge this gap, we propose \textbf{MGUP}, a novel mechanism for selective updates. \textbf{MGUP} augments standard momentum-based optimizers by applying larger step-sizes to a selected fixed proportion of parameters in each iteration, while applying smaller, non-zero step-sizes to the rest. As a nearly {plug-and-play} module, \textbf{MGUP} seamlessly integrates with optimizers such as AdamW, Lion, and Muon. This yields powerful variants such as \textbf{MGUP-AdamW}, \textbf{MGUP-Lion}, and \textbf{MGUP-Muon}. Under standard assumptions, we provide theoretical convergence guarantees for \textbf{MGUP-AdamW} (without weight decay) in stochastic optimization. Extensive experiments across diverse tasks, including MAE pretraining, LLM pretraining, and downstream fine-tuning, demonstrate that our \textbf{MGUP}-enhanced optimizers achieve superior or more stable performance compared to their original base optimizers. We offer a principled, versatile, and theoretically grounded strategy for efficient intra-layer selective updates, accelerating and stabilizing the training of large-scale models. The code is publicly available at https://github.com/MaeChd/MGUP.
翻译:高效优化对于训练大型语言模型至关重要。尽管层内选择性更新已被探索,但缺乏一种既能实现细粒度控制又能确保收敛保证的通用机制。为弥补这一空白,我们提出 \textbf{MGUP},一种新型选择性更新机制。\textbf{MGUP} 通过为每次迭代中选定固定比例的参数施加较大步长,而对其余参数施以较小但非零的步长,从而增强标准动量优化器。作为一种近乎即插即用的模块,\textbf{MGUP} 可无缝集成至 AdamW、Lion 和 Muon 等优化器中,由此产生如 \textbf{MGUP-AdamW}、\textbf{MGUP-Lion} 和 \textbf{MGUP-Muon} 等强大变体。在标准假设下,我们为随机优化中的 \textbf{MGUP-AdamW}(无权重衰减)提供了理论收敛保证。在包括 MAE 预训练、LLM 预训练及下游微调在内的多样化任务上的广泛实验表明,与原始基础优化器相比,经 \textbf{MGUP} 增强的优化器实现了更优或更稳定的性能。我们提供了一种有原则性、通用且具有理论基础的策略,用于高效的层内选择性更新,从而加速并稳定大规模模型的训练。代码已公开于 https://github.com/MaeChd/MGUP。