Modern service systems, including cloud platforms and large language model inference endpoints, must distribute jobs across servers whose processing speeds depend on current workloads. At scale, centralized coordination is costly, while naive distributed policies can perform arbitrarily poorly. We study how to design a simple distributed load balancing policy that achieves globally optimal latency performance in such settings. We model the system as a bipartite queueing network with an arbitrary compatibility graph and servers with concave, workload-dependent service rates. We propose the Greatest Marginal Service Rate (GMSR) policy, which routes jobs to a connected server where it has the largest marginal impact on service rate. In a discrete-time stochastic model, we show that as time discretization is refined (shrinking time step and job size proportionally), the scaled workload process converges almost surely to a fluid limit governed by a differential inclusion. In the fluid regime, GMSR reaches an $ε$-suboptimal solution in $\mathcal{O}(δ+ \log(1/ε))$ time from any $δ$-suboptimal initial state, implying global convergence to the centrally optimal routing. When the system is overloaded, GMSR maximizes throughput, maximizes the number of stabilized backends among throughput-optimal policies, and minimizes total workload over those stabilized backends. GMSR yields a practical routing rule that requires neither demand-rate knowledge nor centralized coordination. By relying only on local information, service providers can achieve near-optimal latency performance through decentralized decisions, making the policy well suited to large-scale cloud computing, LLM serving, and other distributed service environments where centralized control is costly or infeasible.
翻译:摘要:现代服务系统,包括云平台和大语言模型推理端点,必须将作业分发到处理速度取决于当前工作负载的服务器上。在大规模系统中,集中式协调成本高昂,而简单的分布式策略可能表现极差。我们研究如何设计一种简单的分布式负载均衡策略,使其能在此类设置中实现全局最优的延迟性能。我们将系统建模为一个具有任意兼容图和服务速率呈凹函数且与工作负载相关的二分排队网络。我们提出了最大边际服务速率(GMSR)策略,该策略将作业路由到对其服务速率边际影响最大的已连接服务器。在一个离散时间随机模型中,我们证明,随着时间离散化细化(时间步长和作业大小按比例缩小),缩放后的工作负载过程几乎必然收敛于由微分包含控制的流体极限。在流体状态下,GMSR 从任意 δ 次优初始状态出发,能在 $\mathcal{O}(δ+ \log(1/ε))$ 时间内达到 $ε$ 次优解,这意味着其全局收敛于集中式最优路由。当系统过载时,GMSR 可最大化吞吐量,在吞吐量最优策略中最大化稳定后端数量,并最小化这些稳定后端上的总工作负载。GMSR 提供了一种实用的路由规则,既不需要需求速率知识,也不需要集中式协调。仅依赖局部信息,服务提供商即可通过分散决策实现接近最优的延迟性能,这使得该策略非常适合大规模云计算、LLM 服务以及其他集中式控制成本高昂或不可行的分布式服务环境。