The Lion optimizer has been a promising competitor with the AdamW for training large AI models, with advantages on memory, computation, and sample efficiency. In this paper, we introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments. Leveraging the sign operator in Lion, our Distributed Lion only requires communicating binary or lower-precision vectors between workers to the center server, significantly reducing the communication cost. Our theoretical analysis confirms Distributed Lion's convergence properties. Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. Notably, Distributed Lion attains comparable performance to standard Lion or AdamW optimizers applied on aggregated gradients, but with significantly reduced communication bandwidth. This feature is particularly advantageous for training large models. In addition, we also demonstrate that Distributed Lion presents a more favorable performance-bandwidth balance compared to existing efficient distributed methods such as deep gradient compression and ternary gradients.
翻译:Lion优化器在训练大规模AI模型方面已成为AdamW的有力竞争者,其在内存、计算和样本效率方面具有优势。本文提出分布式Lion,一种针对分布式训练环境创新性适配的Lion变体。通过利用Lion中的符号算子,分布式Lion仅需在工作节点与中心服务器间传输二进制或低精度向量即可实现通信,显著降低了通信成本。我们的理论分析证实了分布式Lion的收敛特性。实验结果表明,该优化器在视觉和语言任务中,面对不同任务、工作节点数量及批次大小时均展现出稳健性。值得注意的是,分布式Lion在达到与标准Lion或AdamW优化器应用于聚合梯度时相当的性能的同时,显著降低了通信带宽需求。这一特性对训练大型模型尤为有利。此外,我们证明与现有高效分布式方法(如深度梯度压缩和三值梯度)相比,分布式Lion在性能-带宽平衡方面表现出更优的效果。