Controlling Data Access Load in Distributed Systems

Distributed systems store data objects redundantly to balance the data access load over multiple nodes. Load balancing performance depends mainly on 1) the level of storage redundancy and 2) the assignment of data objects to storage nodes. We analyze the performance implications of these design choices by considering four practical storage schemes that we refer to as clustering, cyclic, block and random design. We formulate the problem of load balancing as maintaining the load on any node below a given threshold. Regarding the level of redundancy, we find that the desired load balance can be achieved in a system of $n$ nodes only if the replication factor $d = \Omega(\log(n)^{1/3})$, which is a necessary condition for any storage design. For clustering and cyclic designs, $d = \Omega(\log(n))$ is necessary and sufficient. For block and random designs, $d = \Omega(\log(n))$ is sufficient but unnecessary. Whether $d = \Omega(\log(n)^{1/3})$ is sufficient remains open. The assignment of objects to nodes essentially determines which objects share the access capacity on each node. We refer to the number of nodes jointly shared by a set of objects as the \emph{overlap} between those objects. We find that many consistently slight overlaps between the objects (block, random) are better than few but occasionally significant overlaps (clustering, cyclic). However, when the demand is ''skewed beyond a level'' the impact of overlaps becomes the opposite. We derive our results by connecting the load-balancing problem to mathematical constructs that have been used to study other problems. For a class of storage designs containing the clustering and cyclic design, we express load balance in terms of the maximum of moving sums of i.i.d. random variables, which is known as the scan statistic. For random design, we express load balance by using the occupancy metric for random allocation with complexes.

翻译：分布式系统通过在多个节点间冗余存储数据对象来平衡数据访问负载。负载均衡性能主要取决于：1）存储冗余度；2）数据对象到存储节点的分配方式。我们通过分析四种实际存储方案（聚类设计、循环设计、区块设计和随机设计）来探讨这些设计选择对性能的影响。将负载均衡问题形式化为确保任意节点的负载低于给定阈值。关于冗余度，我们发现在有$n$个节点的系统中，仅当复制因子$d = \Omega(\log(n)^{1/3})$时才能实现期望的负载均衡——这是所有存储设计的必要条件。对于聚类设计和循环设计，$d = \Omega(\log(n))$是充要条件；而对于区块设计和随机设计，$d = \Omega(\log(n))$是充分但不必要条件。$d = \Omega(\log(n)^{1/3})$是否为充分条件仍待研究。对象到节点的分配本质上决定了哪些对象共享每个节点的访问容量。我们将一组对象共同占用的节点数定义为这些对象的"重叠度"。研究发现，对象间持续轻微重叠（区块设计、随机设计）优于少数偶尔显著重叠（聚类设计、循环设计）。但当需求"超出特定偏斜程度"时，重叠效应会反向变化。我们通过将负载均衡问题与用于研究其他问题的数学构造建立联系来推导结论。对于包含聚类设计和循环设计的一类存储方案，我们将负载均衡表示为独立同分布随机变量移动和的最大值（即扫描统计量）。对于随机设计，我们使用带复杂结构的随机分配占用度量来表达负载均衡。