Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions
翻译:大语言模型(LLMs)已成为现代应用的核心组成部分,但其部署仍面临诸多挑战。除了模型的执行本身,实际部署还必须兼顾成本效率、低延迟和资源优化利用。传统方法通常假设整个模型可托管于单一设备,但这在许多真实场景中难以成立,尤其是在设备资源受限的边缘与雾计算环境中。本文提出E2LLM框架,旨在实现资源受限场景下大语言模型的高效部署。与简单地将单一模型分割到所有可用设备不同,E2LLM将完整模型复制到多组设备(副本)上,并在每个副本内部应用模型并行。每个副本根据其在处理输入与输出token时的效率被赋予专门角色:PREFILL或DECODER。这种分离充分利用了大语言模型推理中这两个阶段的内在差异。为有效组织设备,我们采用遗传算法形成集群以最大化系统性能。在每个集群内部,应用动态规划确定最优分区策略,从而最小化模型并行执行中的瓶颈。实验结果表明,该方法能稳健适应不同工作负载,包括输入与输出token长度显著变化的情况。与Splitwise基线相比,E2LLM在高需求条件下将平均等待时间降低了50%以上。