Balanced allocation: considerations from large scale service environments

We study d-way balanced allocation, which assigns each incoming job to the lightest loaded among d randomly chosen servers. While prior work has extensively studied the performance of the basic scheme, there has been less published work on adapting this technique to many aspects of large-scale systems. Based on our experience in building and running planet-scale cloud applications, we extend the understanding of d-way balanced allocation along the following dimensions: (i) Bursts: Events such as breaking news can produce bursts of requests that may temporarily exceed the servicing capacity of the system. Thus, we explore what happens during a burst and how long it takes for the system to recover from such bursts. (ii) Priorities: Production systems need to handle jobs with a mix of priorities (e.g., user facing requests may be high priority while other requests may be low priority). We extend d-way balanced allocation to handle multiple priorities. (iii) Noise: Production systems are often typically distributed and thus d-way balanced allocation must work with stale or incorrect information. Thus we explore the impact of noisy information and their interactions with bursts and priorities. We explore the above using both extensive simulations and analytical arguments. Specifically we show, (i) using simulations, that d-way balanced allocation quickly recovers from bursts and can gracefully handle priorities and noise; and (ii) that analysis of the underlying generative models complements our simulations and provides insight into our simulation results.

翻译：我们研究d路平衡分配算法，该算法将每个到达的任务分配给随机选择的d个服务器中负载最轻的服务器。尽管先前的研究已深入探讨了该基础方案的性能，但关于如何使该技术适应大规模系统多方面特性的已发表工作相对较少。基于我们在构建和运行全球级云应用方面的实践经验，我们从以下维度拓展对d路平衡分配的理解：（i）突发流量：诸如突发新闻等事件可能产生超过系统临时服务能力的请求洪峰。因此，我们探究突发流量期间系统的表现以及系统从此类突发中恢复所需的时间。（ii）优先级：生产系统需要处理混合优先级的任务（例如，面向用户的请求可能具有高优先级，而其他请求可能为低优先级）。我们扩展d路平衡分配算法以处理多优先级场景。（iii）噪声：生产系统通常具有分布式特性，因此d路平衡分配必须在信息陈旧或不准确的情况下正常工作。为此我们探究噪声信息的影响及其与突发流量和优先级的相互作用。我们通过大量仿真实验和理论分析对上述问题进行研究。具体而言我们证明：（i）通过仿真表明d路平衡分配能快速从突发流量中恢复，并能优雅处理优先级和噪声问题；（ii）对底层生成模型的理论分析补充了我们的仿真结果，并为仿真现象提供了理论洞见。