WDMoE: Wireless Distributed Mixture of Experts for Large Language Models

Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but the role of wireless networks in supporting LLMs has not been thoroughly explored. In this paper, we propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks. Specifically, we decompose the MoE layer in LLMs by placing the gating network and the preceding neural network layer at BS, while distributing the expert networks among the devices. This deployment leverages the parallel inference capabilities of expert networks on mobile devices, effectively utilizing the limited computing and caching resources of these devices. Accordingly, we develop a performance metric for WDMoE-based LLMs, which accounts for both model capability and latency. To minimize the latency while maintaining accuracy, we jointly optimize expert selection and bandwidth allocation based on the performance metric. Moreover, we build a hardware testbed using NVIDIA Jetson kits to validate the effectiveness of WDMoE. Both theoretical simulations and practical hardware experiments demonstrate that the proposed method can significantly reduce the latency without compromising LLM performance.

翻译：大语言模型（Large Language Models, LLMs）在各种自然语言处理任务中取得了显著成功，但无线网络在支持LLM方面的作用尚未得到充分探索。本文提出了一种无线分布式专家混合（WDMoE）架构，以实现LLM在无线网络中基站（BS）的边缘服务器与移动设备之间的协同部署。具体而言，我们将LLM中的MoE层进行分解：将门控网络及之前的神经网络层部署在基站，而将专家网络分布在各个设备上。这种部署利用了专家网络在移动设备上的并行推理能力，有效利用了这些设备有限的计算和缓存资源。相应地，我们为基于WDMoE的LLM建立了一个性能度量指标，该指标同时考虑了模型能力和延迟。为了在保持精度的同时最小化延迟，我们基于该性能指标联合优化专家选择和带宽分配。此外，我们使用NVIDIA Jetson套件构建了一个硬件测试平台，以验证WDMoE的有效性。理论仿真和实际硬件实验均表明，所提方法能够在保证LLM性能的同时显著降低延迟。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日