The sparse activation mechanism of mixture of experts (MoE) model empowers edge intelligence with enhanced training efficiency and reduced computational resource consumption. However, traditional token routing in distributed MoE training faces significant challenges in resource-constrained edge networks characterized by heterogeneous computing capabilities and stochastic token arrivals, which inevitably suffer from workload backlog, resource inefficiency, and performance degradation. To address this issue, we propose a novel Lyapunov-based token routing framework for distributed MoE training over resource-heterogeneous edge networks, termed Stable-MoE. Specifically, we formulate a stochastic optimization problem to maximize both system throughput and gating consistency via optimizing the token routing strategy and computational resource allocation, while ensuring long-term stability of both token and energy queues at the edge devices. Using the Lyapunov optimization, we transform the intractable long-term optimization problem into tractable per-slot subproblems by enabling online decision-making of token routing and computation frequency utilization without the knowledge of future system states. Experimental results on the SVHN and CIFAR-100 datasets demonstrate that Stable-MoE outperforms the baselines with at least 40% and 5% gains in system throughput and test accuracy, respectively.
翻译:混合专家(MoE)模型的稀疏激活机制通过提升训练效率与降低计算资源消耗,为边缘智能赋能。然而,在具有异构计算能力与随机令牌到达特性的资源受限边缘网络中,传统分布式MoE训练的令牌路由面临显著挑战,不可避免地遭遇工作负载积压、资源利用率低下及性能退化等问题。为应对此问题,我们提出一种基于李雅普诺夫优化的新型令牌路由框架,用于资源异构边缘网络上的分布式MoE训练,命名为稳定MoE。具体而言,我们构建了一个随机优化问题,通过优化令牌路由策略与计算资源分配,在保证边缘设备令牌队列与能量队列长期稳定的同时,最大化系统吞吐量与门控一致性。借助李雅普诺夫优化方法,我们将难以求解的长期优化问题转化为可逐时隙处理的子问题,实现在无需预知未来系统状态的情况下进行令牌路由与计算频率利用的在线决策。在SVHN和CIFAR-100数据集上的实验结果表明,稳定MoE在系统吞吐量与测试准确率上分别较基线方法提升至少40%与5%。