Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

Probabilistic Synchronous Parallel (PSP) is a technique in distributed learning systems to reduce synchronization bottlenecks by sampling a subset of participating nodes per round. In Federated Learning (FL), where edge devices are often unreliable due to factors including mobility, power constraints, and user activity, PSP helps improve system throughput. However, PSP has a key limitation: it assumes device behavior is static and different devices are independent. This can lead to unfair distributed synchronization, due to highly available nodes dominating training while those that are often unavailable rarely participate and so their data may be missed. If both data distribution and node availability are simultaneously correlated with the device, then both PSP and standard FL algorithms will suffer from persistent under-representation of certain classes or groups resulting in inefficient or ineffective learning of certain features. We introduce Availability-Weighted PSP (AW-PSP), an extension to PSP that addresses the issue of co-correlation of unfair sampling and data availability by dynamically adjusting node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. A Markov-based availability predictor distinguishes transient \emph{vs} chronic failures, while a Distributed Hash Table (DHT) layer decentralizes metadata, including latency, freshness, and utility scores. We implement AW-PSP and trace-driven evaluation shows that it improves robustness to both independent and correlated failures, increases label coverage, and reduces fairness variance compared to standard PSP. AW-PSP thus provides an availability-aware, and fairness-conscious node sampling protocol for FL deployments that will scale to large numbers of nodes even in heterogeneous and failure-prone environments.

翻译：概率同步并行（PSP）是一种分布式学习系统中通过每轮采样部分参与节点以降低同步瓶颈的技术。在联邦学习（FL）中，由于边缘设备常因移动性、电力约束及用户活动等因素不可靠，PSP有助于提升系统吞吐量。然而，PSP存在一个关键局限：它假设设备行为是静态的，且不同设备相互独立。这可能导致不均衡的分布式同步——高可用节点主导训练过程，而频繁不可用节点极少参与，其数据可能被忽略。若数据分布与节点可用性同时与设备相关联，则PSP及标准FL算法均将遭受特定类别或群体的持续性欠表示，导致某些特征的学习效率低下甚至无效。我们提出可用性加权概率同步并行（AW-PSP），这是PSP的扩展方法，通过利用实时可用性预测、历史行为及故障关联度动态调整节点采样概率，解决非公平采样与数据可用性之间的共关联问题。基于马尔可夫链的可用性预测器可区分瞬时性与慢性故障，而分布式哈希表（DHT）层则对延迟、新鲜度及效用评分等元数据进行去中心化管理。我们实现了AW-PSP，基于真实数据驱动的评估表明：相比标准PSP，AW-PSP在独立故障与关联故障下的鲁棒性均得到提升，标签覆盖率增加，公平性方差减小。因此，AW-PSP为联邦学习部署提供了兼具可用性感知与公平意识的节点采样协议，即使面对异构且易故障环境，也能扩展至大规模节点集群。