Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage/memory burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K\geq1$, expert co-activation within the same MoE layer introduces non-submodularity, which renders greedy methods ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.
翻译:专家混合(MoE)模型通过仅为每个输入激活一小部分相关专家,提升了大规模语言模型(LLM)的可扩展性。然而,MoE模型中庞大的专家网络数量给边缘设备带来了显著的存储/内存负担。为应对这一挑战,我们考虑一种将专家分散部署在边缘网络中以进行分布式推理的场景。基于流行的Top-$K$专家选择策略,我们在存储约束下通过优化边缘服务器上的专家缓存,构建了一个延迟最小化问题。当$K=1$时,该问题可简化为一个带背包约束的单调子模最大化问题,我们为此设计了一种基于贪心的算法,并提供了$(1 - 1/e)$的近似保证。对于$K\geq1$的一般情况,同一MoE层内专家的协同激活引入了非子模性,导致贪心方法失效。为解决此问题,我们提出了一种连续贪心分解方法,将原问题分解为一系列子问题,每个子问题通过动态规划方法求解。此外,我们设计了一种基于最大卷积技术的加速算法,以在多项式时间内获得具有可证明保证的近似解。在不同MoE模型上的仿真结果表明,与现有基线方法相比,我们的方法显著降低了推理延迟。