E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments

Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions

翻译：大语言模型（LLMs）已成为现代应用的核心组成部分，但其部署仍面临诸多挑战。除了模型的执行本身，实际部署还必须兼顾成本效率、低延迟和资源优化利用。传统方法通常假设整个模型可托管于单一设备，但这在许多真实场景中难以成立，尤其是在设备资源受限的边缘与雾计算环境中。本文提出E2LLM框架，旨在实现资源受限场景下大语言模型的高效部署。与简单地将单一模型分割到所有可用设备不同，E2LLM将完整模型复制到多组设备（副本）上，并在每个副本内部应用模型并行。每个副本根据其在处理输入与输出token时的效率被赋予专门角色：PREFILL或DECODER。这种分离充分利用了大语言模型推理中这两个阶段的内在差异。为有效组织设备，我们采用遗传算法形成集群以最大化系统性能。在每个集群内部，应用动态规划确定最优分区策略，从而最小化模型并行执行中的瓶颈。实验结果表明，该方法能稳健适应不同工作负载，包括输入与输出token长度显著变化的情况。与Splitwise基线相比，E2LLM在高需求条件下将平均等待时间降低了50%以上。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

面向性能、成本效益、云边隐私与可信性的大小语言模型协作综述

专知会员服务

15+阅读 · 2025年10月18日

大语言模型与小语言模型协同机制综述

专知会员服务

40+阅读 · 2025年5月15日

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

【NeurIPS2024】《AmoebaLLM：构建任意形状的大型语言模型以实现高效和即时部署》

专知会员服务

22+阅读 · 2024年11月21日