Distributed scientific workflows increasingly span heterogeneous compute clusters, edge resources, and geo-distributed data repositories. In these environments, a centralized orchestrator is an architectural bottleneck -- introducing a single point of failure, limiting scalability, and constraining adaptability to changing resource availability or failures. Decentralized multi-agent coordination offers a compelling alternative: autonomous agents representing distributed resources collaboratively negotiate workload assignment (e.g., job selection) through peer-to-peer consensus, making decisions based on local compute capacity, data locality, and network conditions. However, scaling such systems for production workloads requires addressing challenges in coordination, resilience, and data-aware optimization. This work presents SWARM+, which builds on our prior work that demonstrated the feasibility of multi-agent decentralized consensus for distributed job selection. SWARM+ addresses three main problems: scalability of consensus for large numbers of agents, resilience of workload management under agent failure, and efficiency of job scheduling for highly distributed resources and data-intensive workloads. For each problem, we propose novel algorithms and evaluate them in the distributed FABRIC testbed. The results show that SWARM+ (a) scales to 1000 distributed agents with nearly equal workload distribution across the hierarchy levels and reduced coordination overhead due to hierarchical consensus, (b) is resilient to agent failures, maintaining >99% job completion rate under single agent failure, and demonstrating graceful system degradation, with at most 7.5% impact under 50% agent failures, and (c) achieves 97-98% improvement over baseline SWARM for both selection time and scheduling latency metrics.
翻译:分布式科学工作流日益跨越异构计算集群、边缘资源及地理分布的数据存储库。在此类环境中,集中式编排器构成架构瓶颈——引入单点故障风险,限制可扩展性,并削弱对资源变化或故障的适应能力。去中心化多智能体协调提供了一种有力替代方案:代表分布式资源的自主智能体通过点对点共识协作协商工作负载分配(如作业选择),基于本地计算能力、数据局部性及网络条件做出决策。然而,将此类系统扩展至生产级工作负载需应对协调、弹性及数据感知优化方面的挑战。本文提出SWARM+,其基于我们先前验证多智能体去中心化共识用于分布式作业选择可行性的工作。SWARM+解决三大问题:面向大规模智能体的共识可扩展性、智能体故障下的工作负载管理弹性,以及针对高度分布式资源与数据密集型工作负载的作业调度效率。针对每个问题,我们提出新颖算法,并在分布式FABRIC测试平台上进行验证。结果表明,SWARM+(a)可扩展至1000个分布式智能体,通过分层共识实现层级间近乎均衡的工作负载分布并降低协调开销;(b)对智能体故障具有弹性,单智能体故障下维持>99%的作业完成率,且在50%智能体故障时仅产生至多7.5%的影响,展现优雅的系统降级;以及(c)在选择时间与调度延迟指标上,较基线SWARM实现97-98%的提升。