Training large-scale Mixture-of-Experts (MoE) models typically requires high-memory, high-bandwidth GPUs (e.g., A100), and their high cost has become a major barrier to large-model training. In contrast, affordable hardware is low-cost but constrained by memory capacity and bandwidth, making it unsuitable for direct LLM training. To address this, we propose MoE-DisCo (Mixture-of-Experts with Disentangled Clustering and Coordination), a staged training framework. MoE-DisCo decomposes the MoE model into multiple dense submodels, each consisting of a shared backbone and a single expert, and partitions the training data into subsets using unsupervised clustering. Each submodel is trained independently and in parallel on its assigned data subset using low-cost devices, without any inter-device communication. Subsequently, all experts are integrated into a complete MoE model and fine-tuned globally for a short period on high-memory, high-bandwidth GPUs. Experiments show that our method matches or even surpasses full-parameter training in performance across multiple downstream tasks, loss function, and perplexity (PPL), while reducing training cost by 47.6 percent to 69.5 percent on Qwen1.5-MoE-2.7B and Llama-MoE-3.5B across different datasets.
翻译:训练大规模混合专家(Mixture-of-Experts,MoE)模型通常需要高内存、高带宽的GPU(如A100),其高昂成本已成为大模型训练的主要障碍。相比之下,经济型硬件成本较低,但受限于内存容量和带宽,不适合直接用于大语言模型训练。为此,我们提出了MoE-DisCo(基于解耦聚类与协调的混合专家模型),一种分阶段训练框架。MoE-DisCo将MoE模型分解为多个稠密子模型,每个子模型包含一个共享主干网络和单个专家,并通过无监督聚类将训练数据划分为多个子集。每个子模型使用低成本设备在其分配的数据子集上独立并行训练,无需任何设备间通信。随后,将所有专家整合为一个完整的MoE模型,并在高内存、高带宽GPU上进行短期的全局微调。实验表明,在多个下游任务、损失函数和困惑度(PPL)指标上,我们的方法在性能上匹配甚至超越了全参数训练,同时在Qwen1.5-MoE-2.7B和Llama-MoE-3.5B模型的不同数据集上,将训练成本降低了47.6%至69.5%。