Disaggregation has emerged as a powerful strategy for optimizing large language model (LLM) inference by separating compute-intensive prefill and memory-bound decode phases across specialized GPUs. This separation improves utilization and throughput under fixed hardware capacity. However, as model and cluster scales grow, power, rather than compute, has become the dominant limiter of overall performance and cost efficiency. In this paper, we propose RAPID, a power-aware disaggregated inference framework that jointly manages GPU roles and power budgets to sustain goodput within strict power caps. RAPID utilizes static and dynamic power reallocation in addition to GPU reallocation to improve performance under fixed power bounds. RAPID improves overall performance and application consistency beyond what is achievable in current disaggregation solutions, resulting in up to a 2x improvement in SLO attainment at peak load when compared to a static assignment without an increase in complexity or cost.
翻译:解耦已成为优化大型语言模型推理的强大策略,通过将计算密集的预填充阶段与内存受限的解码阶段分离至专用GPU上执行。这种分离在固定硬件容量下提高了利用率和吞吐量。然而,随着模型和集群规模的扩大,功耗而非计算能力已成为整体性能和成本效益的主要限制因素。本文提出RAPID,一种功率感知的解耦推理框架,通过联合管理GPU角色与功率预算,在严格的功率上限内维持有效吞吐量。RAPID除GPU重分配外,还采用静态与动态功率重分配策略,以在固定功率约束下提升性能。相较于现有解耦方案,RAPID在整体性能和应用一致性方面取得显著提升,在峰值负载时服务等级目标达成率最高可提升2倍,且未增加系统复杂度或成本。