We consider the problem of stragglers in distributed computing systems. Stragglers, which are compute nodes that unpredictably slow down, often increase the completion times of tasks. One common approach to mitigating stragglers is work replication, where only the first completion among replicated tasks is accepted, discarding the others. However, discarding work leads to resource wastage. In this paper, we propose a method for exploiting the work completed by stragglers rather than discarding it. The idea is to increase the granularity of the assigned work, and to increase the frequency of worker updates. We show that the proposed method reduces the completion time of tasks via experiments performed on a simulated cluster as well as on Amazon EC2 with Apache Hadoop.
翻译:本文研究分布式计算系统中的落后节点问题。落后节点是指计算速度不可预测地减慢的计算节点,通常会延长任务的完成时间。缓解落后节点的常见方法是工作复制,即只接受复制任务中第一个完成的结果,而丢弃其他结果。然而,丢弃已完成的工作会导致资源浪费。本文提出一种方法,旨在利用落后节点已完成的工作而非将其丢弃。该方法的核心思想是增加分配工作的粒度,并提高工作节点更新频率。通过在模拟集群和Amazon EC2上的Apache Hadoop环境中进行实验,我们证明所提出的方法能够有效降低任务的完成时间。