MoE-DisCo：低成本训练混合专家模型 (MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models)

Training large-scale Mixture-of-Experts (MoE) models typically requires high-memory, high-bandwidth GPUs (e.g., A100), and their high cost has become a major barrier to large-model training. In contrast, affordable hardware is low-cost but constrained by memory capacity and bandwidth, making it unsuitable for direct LLM training. To address this, we propose MoE-DisCo (Mixture-of-Experts with Disentangled Clustering and Coordination), a staged training framework. MoE-DisCo decomposes the MoE model into multiple dense submodels, each consisting of a shared backbone and a single expert, and partitions the training data into subsets using unsupervised clustering. Each submodel is trained independently and in parallel on its assigned data subset using low-cost devices, without any inter-device communication. Subsequently, all experts are integrated into a complete MoE model and fine-tuned globally for a short period on high-memory, high-bandwidth GPUs. Experiments show that our method matches or even surpasses full-parameter training in performance across multiple downstream tasks, loss function, and perplexity (PPL), while reducing training cost by 47.6 percent to 69.5 percent on Qwen1.5-MoE-2.7B and Llama-MoE-3.5B across different datasets.

翻译：训练大规模混合专家（Mixture-of-Experts，MoE）模型通常需要高内存、高带宽的GPU（如A100），其高昂成本已成为大模型训练的主要障碍。相比之下，经济型硬件成本较低，但受限于内存容量和带宽，不适合直接用于大语言模型训练。为此，我们提出了MoE-DisCo（基于解耦聚类与协调的混合专家模型），一种分阶段训练框架。MoE-DisCo将MoE模型分解为多个稠密子模型，每个子模型包含一个共享主干网络和单个专家，并通过无监督聚类将训练数据划分为多个子集。每个子模型使用低成本设备在其分配的数据子集上独立并行训练，无需任何设备间通信。随后，将所有专家整合为一个完整的MoE模型，并在高内存、高带宽GPU上进行短期的全局微调。实验表明，在多个下游任务、损失函数和困惑度（PPL）指标上，我们的方法在性能上匹配甚至超越了全参数训练，同时在Qwen1.5-MoE-2.7B和Llama-MoE-3.5B模型的不同数据集上，将训练成本降低了47.6%至69.5%。