The emergence of Mixture-of-Experts (MoE) has transformed the scaling of large language models by enabling vast model capacity through sparse activation. Yet, converting these performance gains into practical edge deployment remains difficult, as the massive memory footprint and communication demands often overwhelm resource-limited environments. While centralized cloud-based solutions are available, they are frequently plagued by prohibitive infrastructure costs, latency issues, and privacy concerns. Moreover, existing edge-oriented optimizations largely overlook the complexities of heterogeneous hardware, focusing instead on isolated or uniform device setups. In response, this paper proposes Prism, an inference framework engineered for collaborative MoE serving across diverse GPU-equipped edge servers. By leveraging the intrinsic sparsity and input locality of MoE workloads, Prism minimizes inter-server communication and optimizes expert placement within diverse resource constraints. The framework integrates an activation-aware placement strategy that balances local request coverage with memory utilization, supplemented by a runtime migration mechanism to adapt expert distribution to dynamic workload changes. Experiments on contemporary MoE models and datasets demonstrate that Prism reduces inference latency by up to 30.6% and significantly lowers communication costs compared to state-of-the-art baselines, confirming the effectiveness of cooperative edge-based MoE serving.
翻译:混合专家模型(MoE)的出现通过稀疏激活实现了大规模模型容量,从而改变了大型语言模型的扩展方式。然而,将这些性能优势转化为实际的边缘部署仍然困难重重,因为巨大的内存占用和通信需求常常使资源受限的环境难以承受。尽管存在集中式云解决方案,但这些方案往往受到高昂的基础设施成本、延迟问题和隐私担忧的困扰。此外,现有的面向边缘的优化在很大程度上忽视了异构硬件的复杂性,而是专注于孤立或统一的设备配置。为此,本文提出了Prism,一个专为跨多样化GPU边缘服务器协同MoE服务而设计的推理框架。通过利用MoE工作负载固有的稀疏性和输入局部性,Prism最小化服务器间通信,并在多样化资源约束下优化专家放置。该框架整合了一种激活感知的放置策略,以平衡本地请求覆盖范围与内存利用率,并辅以运行时迁移机制,以适应动态工作负载变化调整专家分布。在当代MoE模型和数据集上的实验表明,与现有最先进基线相比,Prism将推理延迟降低了高达30.6%,并显著降低了通信成本,证实了协同边缘MoE服务的有效性。