Safra's distributed termination detection algorithm employs a logical token ring structure within a distributed network; only passive nodes forward the token, and a counter in the token keeps track of the number of sent minus the number of received messages. We adapt this classic algorithm to make it fault-tolerant. The counter is split into counters per node, to discard counts from crashed nodes. If a node crashes, the token ring is restored locally and a backup token is sent. Nodes inform each other of detected crashes via the token. Our algorithm imposes no additional message overhead, tolerates any number of crashes as well as simultaneous crashes, and copes with crashes in a decentralized fashion. Correctness proofs are provided of both the original Safra's algorithm and its fault-tolerant variant, as well as a model checking analysis.
翻译:Safra分布式终止检测算法在分布式网络中采用逻辑令牌环结构:仅被动节点转发令牌,令牌中的计数器记录已发送消息数与已接收消息数之差。本文对这一经典算法进行改进以实现容错性。通过将计数器拆分为各节点独立计数器,可丢弃故障节点的计数。当节点发生故障时,令牌环将在本地恢复并发送备份令牌。节点通过令牌相互通知检测到的故障。本算法不增加额外消息开销,可容忍任意数量及并发故障,并以去中心化方式处理故障。本文提供了原始Safra算法及其容错变体的正确性证明,并进行了模型检验分析。