The sparse activation mechanism of mixture of experts (MoE) model empowers edge intelligence with enhanced training efficiency and reduced computational resource consumption. However, traditional token routing in distributed MoE training faces significant challenges in resource-constrained edge networks characterized by heterogeneous computing capabilities and stochastic token arrivals, which inevitably suffer from workload backlog, resource inefficiency, and performance degradation. To address this issue, we propose a novel Lyapunov-based token routing framework for distributed MoE training over resource-heterogeneous edge networks, termed Stable-MoE. Specifically, we formulate a stochastic optimization problem to maximize both system throughput and gating consistency via optimizing the token routing strategy and computational resource allocation, while ensuring long-term stability of both token and energy queues at the edge devices. Using the Lyapunov optimization, we transform the intractable long-term optimization problem into tractable per-slot subproblems by enabling online decision-making of token routing and computation frequency utilization without the knowledge of future system states. Experimental results on the SVHN and CIFAR-100 datasets demonstrate that Stable-MoE outperforms the baselines with at least 40% and 5% gains in system throughput and test accuracy, respectively.
翻译:专家混合模型(MoE)的稀疏激活机制通过提升训练效率和降低计算资源消耗,为边缘智能赋能。然而,在计算能力异构且令牌随机到达的资源受限边缘网络中,传统分布式MoE训练的令牌路由面临显著挑战,不可避免地遭遇工作负载积压、资源利用率低下和性能退化等问题。为解决此问题,我们提出一种基于李雅普诺夫优化的新型令牌路由框架,用于资源异构边缘网络上的分布式MoE训练,称为Stable-MoE。具体而言,我们构建了一个随机优化问题,通过优化令牌路由策略和计算资源分配,在确保边缘设备上令牌队列与能量队列长期稳定的同时,最大化系统吞吐量与门控一致性。利用李雅普诺夫优化方法,我们将难以处理的长期优化问题转化为可逐时隙求解的子问题,实现在无需预知未来系统状态的情况下,对令牌路由和计算频率利用进行在线决策。在SVHN和CIFAR-100数据集上的实验结果表明,Stable-MoE在系统吞吐量和测试准确率上分别较基线方法提升至少40%和5%。