Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to reduce the model size to fit the resource-constraint edge devices, but they lead to accuracy loss. Other works use cloud-edge collaboration, suffering from unstable network connections. In this work, we leverage collaborative edge computing to facilitate the collaboration among edge devices and cloud servers for jointly performing efficient LLM inference. We propose a general framework to partition the LLM model into shards and deploy on distributed devices. To achieve efficient LLM inference, we formulate an adaptive joint device selection and model partition problem and design an efficient dynamic programming algorithm to optimize the inference latency and throughput, respectively. Experiments of Llama2 serial models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement over baseline methods.
翻译:大语言模型(LLM)在自然语言处理和内容生成方面展现出巨大潜力。然而,当前的大语言模型严重依赖云计算,导致延迟延长、带宽成本高昂以及隐私问题。边缘计算通过将大语言模型部署在更靠近数据源的边缘设备上,有望解决这些问题。部分研究工作尝试利用模型量化来减小模型规模以适应资源受限的边缘设备,但这会导致精度损失。另一些研究采用云边协同方案,却受限于不稳定的网络连接。在本工作中,我们利用协同边缘计算促进边缘设备与云服务器之间的协作,以共同实现高效的大语言模型推理。我们提出了一个通用框架,将大语言模型划分为多个分片并部署在分布式设备上。为实现高效推理,我们将设备选择与模型划分的联合优化问题形式化为自适应问题,并分别设计了高效动态规划算法以优化推理延迟和吞吐量。在异构物理原型上对Llama2系列模型进行的实验表明,与基线方法相比,EdgeShard最高可降低50%的延迟并实现2倍的吞吐量提升。