Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately $1.5\times$ end-to-end speedup compared to runs with grid-searched stepsizes.
翻译:大语言模型已在多个领域取得重大进展,但其训练过程仍极其消耗资源。本文重新审视Sign-SGD方法,该方法既可作为单节点训练的内存高效优化器,也可作为分布式学习的梯度压缩机制。本研究致力于解决其核心局限:由于依赖于未知且问题特定的参量,其有效步长无法预先确定。我们提出一种免参数的Sign-SGD方法,消除了手动步长选择的需求。我们分析了确定性单节点场景,并将该方法扩展至随机单节点训练与多节点环境。同时,我们将动量技术融入算法,并提出一种仅存储梯度符号而非完整梯度的内存高效变体。我们在LLaMA模型(130M和350M参数)的预训练及Swin Transformer(28M参数)的微调任务上评估所提方法。在所有考察任务中,所提方法在避免调参开销的同时,达到了经调优的Sign-SGD与AdamW(采用余弦调度进行网格搜索步长)的性能水平。采用免参数训练相比网格搜索步长的实验设置,可实现约$1.5\times$的端到端加速。