Mixture-of-Experts (MoE) have become a cornerstone for training and scaling large language models (LLMs), offering substantial gains in model capacity and efficiency through sparse expert activation. However, serving these models remains challenging in practice, particularly in resource-constrained edge environments, due to their large memory footprint and complex communication demands. While centralized cloud inference is common, it incurs high infrastructure costs, along with latency and privacy concerns. A few recent edge MoE works propose memory-efficient strategies but typically focus on single-device or homogeneous setups. This paper presents DanceMoE, an efficient MoE inference framework that enables activation-aware expert placement across collaborative, heterogeneous, GPU-equipped edge servers. DanceMoE leverages the inherent sparsity of MoE models and workload locality to minimize cross-server communication and enable efficient expert placement under heterogeneous resource constraints. It introduces a data-driven, activation-aware placement algorithm that balances local coverage and memory usage across servers, alongside a lightweight migration mechanism that adapts expert assignments under evolving workloads. We evaluate DanceMoE on modern MoE models and widely used datasets, demonstrating up to 30.6\% lower inference latency, and substantial communication reduction compared to state-of-the-art baselines, showcasing the effectiveness of collaborative edge-based MoE inference.
翻译:混合专家(MoE)模型已成为训练和扩展大语言模型(LLM)的基石,通过稀疏专家激活在模型容量和效率方面带来显著提升。然而,由于这类模型具有较大的内存占用和复杂的通信需求,在实际部署中,尤其是在资源受限的边缘环境中,其推理服务仍面临挑战。虽然集中式云端推理是常见做法,但其存在基础设施成本高、延迟及隐私问题。近期少数边缘MoE研究工作提出了内存优化策略,但通常聚焦于单设备或同构环境。本文提出DanceMoE,一种高效的MoE推理框架,能够在配备GPU的协作式异构边缘服务器集群中实现基于激活感知的专家放置。DanceMoE利用MoE模型固有的稀疏性和工作负载局部性,以最小化跨服务器通信,并在异构资源约束下实现高效的专家放置。该框架提出一种数据驱动的、基于激活感知的放置算法,在服务器间平衡本地覆盖范围与内存使用,并配备轻量级迁移机制以适配动态变化的工作负载。我们在现代MoE模型和广泛使用的数据集上评估DanceMoE,与最先进的基线方法相比,推理延迟最高降低30.6%,通信开销大幅减少,证明了基于协作式边缘的MoE推理的有效性。