Load balancing-the allocation of work across parallel resources to reduce delay, energy and cost-is a pervasive challenge in science and engineering, from large-scale simulation and data processing to cloud and manufacturing operations. Motivated by the emerging bottleneck in large language model (LLM) serving, we study a particularly stringent regime of load balancing that arises in barrier-synchronized, stateful systems: work cannot be freely migrated and progress is gated by the slowest participant at each step, so heterogeneity and temporal drift in workloads create persistent stragglers and substantial idle time. LLM serving under data-parallel decoding provides a prominent modern instance: in production traces, barrier-induced idle can exceed 40% of compute time per decode step. Here we develop a universal load-balancing principle, which admits a step-wise finite-horizon integer-optimization formulation and yields worst-case guarantees: across LLM decode models and a broader class of non-decreasing workload drift processes, it reduces long-run imbalance by a factor that grows with batch size and system scale. Extensive experiments corroborate the theory, showing substantial improvements in throughput and latency together with reductions in energy consumption. These results provide a general, theoretically grounded framework for load balancing, with immediate implications for sustainable LLM serving and broad relevance to other synchronization-gated resource-allocation problems.
翻译:负载均衡——通过将工作分配到并行资源中以减少延迟、能耗与成本——是科学与工程领域普遍存在的挑战,其应用范围涵盖大规模仿真与数据处理,乃至云计算与制造运营。受大语言模型(LLM)服务中新兴瓶颈的启发,我们研究了一类在屏障同步、有状态系统中出现的严苛负载均衡机制:工作无法自由迁移,且每一步的进度均受制于最慢的参与者,因此工作负载的异构性与时间漂移会导致持续的掉队者与大量空闲时间。基于数据并行解码的LLM服务即是一个典型的现代实例:在生产环境追踪中,屏障同步导致的空闲时间可超过每个解码步骤计算时间的40%。本文提出了一种通用负载均衡原理,该原理可通过逐步有限时域整数优化进行建模,并提供最坏情况保证:在LLM解码模型及更广泛的非递减工作负载漂移过程类别中,该原理可将长期不平衡度降低一个随批次大小与系统规模增长的因子。大量实验验证了理论结果,显示出吞吐量与延迟的显著改善以及能耗的降低。这些成果为负载均衡提供了一个具有理论基础的通用框架,对可持续的LLM服务具有直接指导意义,并对其他同步门控资源分配问题具有广泛适用性。