The emergence of Mixture-of-Experts (MoE) has transformed the scaling of large language models by enabling vast model capacity through sparse activation. Yet, converting these performance gains into practical edge deployment remains difficult, as the massive memory footprint and communication demands often overwhelm resource-limited environments. While centralized cloud-based solutions are available, they are frequently plagued by prohibitive infrastructure costs, latency issues, and privacy concerns. Moreover, existing edge-oriented optimizations largely overlook the complexities of heterogeneous hardware, focusing instead on isolated or uniform device setups. In response, this paper proposes Prism, an inference framework engineered for collaborative MoE serving across diverse GPU-equipped edge servers. By leveraging the intrinsic sparsity and input locality of MoE workloads, Prism minimizes inter-server communication and optimizes expert placement within diverse resource constraints. The framework integrates an activation-aware placement strategy that balances local request coverage with memory utilization, supplemented by a runtime migration mechanism to adapt expert distribution to dynamic workload changes. Experiments on contemporary MoE models and datasets demonstrate that Prism reduces inference latency by up to 30.6% and significantly lowers communication costs compared to state-of-the-art baselines, confirming the effectiveness of cooperative edge-based MoE serving.
翻译:混合专家(MoE)模型的出现通过稀疏激活实现了大模型容量的扩展,从而改变了大型语言模型的规模化范式。然而,将这些性能优势转化为实际的边缘部署仍然面临挑战,因为巨大的内存占用和通信需求常常使资源受限的环境不堪重负。虽然存在集中式的云解决方案,但它们往往受到高昂的基础设施成本、延迟问题和隐私顾虑的困扰。此外,现有的面向边缘的优化方案大多忽视了异构硬件的复杂性,反而专注于孤立或统一的设备设置。为此,本文提出了Prism,这是一个专为跨多样化配备GPU的边缘服务器进行协作式MoE服务而设计的推理框架。通过利用MoE工作负载固有的稀疏性和输入局部性,Prism最大限度地减少了服务器间通信,并在多样化的资源约束下优化了专家放置。该框架集成了一个激活感知的放置策略,该策略平衡了本地请求覆盖与内存利用率,并辅以运行时迁移机制,使专家分布能够适应动态工作负载的变化。在当代MoE模型和数据集上的实验表明,与最先进的基线方法相比,Prism将推理延迟降低了高达30.6%,并显著降低了通信成本,从而证实了基于协作式边缘的MoE服务的有效性。