Large Language Models (LLMs) have achieved remarkable success in serving end-users with human-like intelligence. However, LLMs demand high computational resources, making it challenging to deploy them to satisfy various performance objectives, such as meeting the resource constraints on edge devices close to end-users or achieving high accuracy with ample resources. In this paper, we introduce CE-CoLLM, a novel cloud-edge collaboration framework that supports efficient and adaptive LLM inference for end-users at the edge with two modes, (1) low-latency edge standalone inference and (2) highly accurate cloud-edge collaborative inference. First, we show that the inherent high communication costs for transmitting LLM contextual information between the edge and cloud dominate the overall latency, making it inefficient and costly to deploy LLMs using cloud-edge collaboration. Second, we propose several critical techniques to address this challenge, including early-exit mechanism, cloud context manager, and quantization in cloud-edge collaboration to enable not only low-latency standalone edge inference but also efficient and adaptive cloud-edge collaborative inference for LLMs. Third, we perform comprehensive experimental analysis, which demonstrates that CE-CoLLM significantly reduces inference time by up to 13.81% and cloud computation costs by up to 84.55% compared to the popular cloud-based LLM deployment, while maintaining comparable model accuracy. The proposed approach effectively shifts the computational load to the edge, reduces the communication overhead, scales efficiently with multiple edge clients, and provides reliable LLM deployment using cloud-edge collaboration.
翻译:大语言模型(LLMs)在为用户提供类人智能服务方面取得了显著成功。然而,LLMs对计算资源要求极高,这使得在满足不同性能目标(例如,满足靠近用户的边缘设备的资源约束,或在资源充足时实现高精度)方面部署LLMs面临挑战。本文提出CE-CoLLM,一种新颖的云边协同框架,通过两种模式支持面向边缘用户的高效自适应LLM推理:(1)低延迟边缘独立推理和(2)高精度云边协同推理。首先,我们指出,在边缘与云端之间传输LLM上下文信息所固有的高通信成本主导了整体延迟,使得基于云边协同部署LLMs效率低下且成本高昂。其次,我们提出了若干关键技术以应对这一挑战,包括提前退出机制、云端上下文管理器以及云边协同中的量化技术,从而不仅实现低延迟的边缘独立推理,也为LLMs提供高效自适应的云边协同推理。第三,我们进行了全面的实验分析,结果表明,与流行的基于云端的LLM部署方案相比,CE-CoLLM将推理时间显著降低了最高13.81%,并将云端计算成本降低了最高84.55%,同时保持了可比的模型精度。所提出的方法有效地将计算负载转移至边缘,降低了通信开销,能够高效地扩展到多个边缘客户端,并通过云边协同提供了可靠的LLM部署。