Over 40% of computational power in Large Language Model (LLM) serving systems can be systematically wasted - not from hardware limits, but from load imbalance in barrier-synchronized parallel processing. When progress is gated by the slowest worker at each step, heterogeneous and evolving workloads create persistent stragglers; faster workers idle while drawing power, producing nothing. In large language model inference alone, this translates to gigawatt-hours of wasted electricity daily. Here we develop a universal load-balancing principle for barrier-synchronized systems with non-migratable state. We prove worst-case theoretical guarantees: imbalance reduction grows with system scale, and the resulting energy savings can exceed 52% for modern hardware at fleet scale. Experiments corroborate the theory, demonstrating 28% energy reduction alongside substantial throughput and latency improvements. Formulated as an online integer optimization with provable guarantees, the principle extends beyond LLM serving to broad classes of barrier-synchronized parallel systems, establishing a theoretical foundation for sustainable high-performance computing.
翻译:在大语言模型(LLM)服务系统中,超过40%的计算能力被系统性浪费——并非源于硬件限制,而是由屏障同步并行处理中的负载不均衡所致。当每个步骤的进度受限于最慢的工作节点时,异构且动态变化的工作负载会产生持续的滞后节点;更快的节点在消耗功率的同时处于空闲状态,无法产生任何有效输出。仅在大语言模型推理场景中,这一现象就导致每日吉瓦时级的电力浪费。本文针对具有不可迁移状态的屏障同步系统,提出了一种通用负载均衡原理。我们证明了最坏情况下的理论保证:不均衡减少量随系统规模增长,且在现代硬件集群规模下,由此实现的节能效果可超过52%。实验验证了该理论,在显著提升吞吐量和降低延迟的同时,实现了28%的能耗降低。该原理被构建为具有可证明保证的在线整数优化问题,其应用范围可扩展至LLM服务之外的广泛屏障同步并行系统类别,为可持续高性能计算奠定了理论基础。