Consider the setting of multiple random walks (RWs) on a graph executing a certain computational task. For instance, in decentralized learning via RWs, a model is updated at each iteration based on the local data of the visited node and then passed to a randomly chosen neighbor. RWs can fail due to node or link failures. The goal is to maintain a desired number of RWs to ensure failure resilience. Achieving this is challenging due to the lack of a central entity to track which RWs have failed to replace them with new ones by forking (duplicating) surviving ones. Without duplications, the number of RWs will eventually go to zero, causing a catastrophic failure of the system. We propose a decentralized algorithm called DECAFORK that can maintain the number of RWs in the graph around a desired value even in the presence of arbitrary RW failures. Nodes continuously estimate the number of surviving RWs by estimating their return time distribution and fork the RWs when failures are likely to happen. We present extensive numerical simulations that show the performance of DECAFORK regarding fast detection and reaction to failures. We further present theoretical guarantees on the performance of this algorithm.
翻译:考虑多个随机游走在图上执行特定计算任务的场景。例如,在基于随机游走的去中心化学习中,模型在每次迭代时根据访问节点的本地数据进行更新,随后传递给随机选择的相邻节点。随机游走可能因节点或链路故障而失效。本文的目标是维持期望数量的随机游走以确保故障弹性。由于缺乏中央实体来追踪哪些随机游走已失效,并通过分叉(复制)存活的随机游走进行替换,实现这一目标具有挑战性。若无复制机制,随机游走数量最终将归零,导致系统发生灾难性故障。我们提出一种名为DECAFORK的去中心化算法,该算法即使在任意随机游走故障存在的情况下,仍能将图中随机游走数量维持在期望值附近。节点通过估计返回时间分布持续评估存活随机游走数量,并在可能发生故障时对随机游走进行分叉复制。我们通过大量数值模拟展示了DECAFORK在快速检测故障与实施响应方面的性能,并进一步提供了该算法性能的理论保证。