Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To mitigate limited per-expert data utilization under sparse expert updates, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.
翻译:大语言模型(LLMs)的预训练通常需要配备数千块高内存GPU(如H100/A100)的集中式集群。近年来,去中心化训练方法通过采用联合优化降低通信开销,但此类方法仍需在每个节点上训练完整模型,因而仍受限于GPU内存限制。本文提出一种用于预训练混合专家(MoE)大语言模型的内存高效去中心化框架——稀疏专家同步(SPES)。SPES仅在每个节点上训练子集专家,显著降低内存占用。各节点更新本地专家后与其他节点周期性同步,既避免了全参数传输,又确保了高效知识共享。针对稀疏专家更新导致的单专家数据利用率不足问题,我们引入专家合并热身策略,在训练初期通过专家间知识交换快速建立基础能力。借助SPES框架,我们使用16块独立48GB GPU在互联网连接上训练了包含20亿参数的MoE大语言模型,在相近计算预算下其性能可与集中式训练模型相媲美。我们进一步展示了该方法的可扩展性:从零训练70亿参数模型,以及从稠密检查点上采样得到90亿参数模型,二者均达到此前集中式基线水平。相关代码已开源至https://github.com/zjr2000/SPES。