Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but the role of wireless networks in supporting LLMs has not been thoroughly explored. In this paper, we propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks. Specifically, we decompose the MoE layer in LLMs by placing the gating network and the preceding neural network layer at BS, while distributing the expert networks among the devices. This deployment leverages the parallel inference capabilities of expert networks on mobile devices, effectively utilizing the limited computing and caching resources of these devices. Accordingly, we develop a performance metric for WDMoE-based LLMs, which accounts for both model capability and latency. To minimize the latency while maintaining accuracy, we jointly optimize expert selection and bandwidth allocation based on the performance metric. Moreover, we build a hardware testbed using NVIDIA Jetson kits to validate the effectiveness of WDMoE. Both theoretical simulations and practical hardware experiments demonstrate that the proposed method can significantly reduce the latency without compromising LLM performance.
翻译:大语言模型(Large Language Models, LLMs)在各种自然语言处理任务中取得了显著成功,但无线网络在支持LLM方面的作用尚未得到充分探索。本文提出了一种无线分布式专家混合(WDMoE)架构,以实现LLM在无线网络中基站(BS)的边缘服务器与移动设备之间的协同部署。具体而言,我们将LLM中的MoE层进行分解:将门控网络及之前的神经网络层部署在基站,而将专家网络分布在各个设备上。这种部署利用了专家网络在移动设备上的并行推理能力,有效利用了这些设备有限的计算和缓存资源。相应地,我们为基于WDMoE的LLM建立了一个性能度量指标,该指标同时考虑了模型能力和延迟。为了在保持精度的同时最小化延迟,我们基于该性能指标联合优化专家选择和带宽分配。此外,我们使用NVIDIA Jetson套件构建了一个硬件测试平台,以验证WDMoE的有效性。理论仿真和实际硬件实验均表明,所提方法能够在保证LLM性能的同时显著降低延迟。