Deploying Large Language Model (LLM) services at the edge benefits latency-sensitive and privacy-aware applications. However, the stateless nature of LLMs makes managing user context (e.g., sessions, preferences) across geo-distributed edge nodes challenging. Existing solutions, such as client-side context storage, introduce network latency and bandwidth overhead, undermining edge deployment advantages. We propose DisCEdge, a distributed context management system that stores and replicates user context in tokenized form across edge nodes. By maintaining context as token sequences, our system avoids redundant computation and enables efficient data replication. We evaluate an open-source prototype in a realistic edge environment. DisCEdge improves median response times by up to 14.46% and lowers median inter-node synchronization overhead by up to 15% compared to a raw-text-based system. It also reduces client request sizes by a median of 90% compared to client-side context management, while guaranteeing data consistency.
翻译:将大语言模型服务部署至边缘可惠及时延敏感型及隐私感知型应用。然而,LLM的无状态特性使得在地理分布式边缘节点上管理用户上下文(如会话、偏好)面临挑战。现有方案(如客户端侧上下文存储)会引入网络延迟与带宽开销,削弱边缘部署优势。本文提出DisCEdge——一种分布式上下文管理系统,该系统以token化形式在边缘节点间存储并复制用户上下文。通过将上下文维护为token序列,系统避免了冗余计算并支持高效数据复制。我们在真实边缘环境中对开源原型系统进行了评估。与基于原始文本的系统相比,DisCEdge将中位响应时间优化了14.46%,并将中位节点间同步开销降低了15%。同时,相较于客户端侧上下文管理方案,客户端请求规模中位数缩减了90%,且确保数据一致性。