The advent of serverless computing has ushered in notable advancements in distributed machine learning, particularly within parameter server-based architectures. Yet, the integration of serverless features within peer-to-peer (P2P) distributed networks remains largely uncharted. In this paper, we introduce SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML training architecture. designed to bridge this existing gap. Capitalizing on the inherent robustness and reliability innate to P2P systems, SPIRT employs RedisAI for in-database operations, leading to an 82\% reduction in the time required for model updates and gradient averaging across a variety of models and batch sizes. This architecture showcases resilience against peer failures and adeptly manages the integration of new peers, thereby highlighting its fault-tolerant characteristics and scalability. Furthermore, SPIRT ensures secure communication between peers, enhancing the reliability of distributed machine learning tasks. Even in the face of Byzantine attacks, the system's robust aggregation algorithms maintain high levels of accuracy. These findings illuminate the promising potential of serverless architectures in P2P distributed machine learning, offering a significant stride towards the development of more efficient, scalable, and resilient applications.
翻译:无服务器计算的兴起推动了分布式机器学习领域的显著进步,尤其是在基于参数服务器的架构中。然而,在点对点(P2P)分布式网络中集成无服务器特性的研究仍处于空白阶段。本文提出SPIRT——一种容错、可靠且安全的无服务器P2P机器学习训练架构,旨在填补这一研究空白。SPIRT利用P2P系统固有的鲁棒性和可靠性,采用RedisAI实现数据库内操作,使多种模型与批处理规模下的模型更新与梯度平均时间减少82%。该架构不仅展现出对节点故障的恢复能力,还能高效管理新节点的加入,充分体现其容错特性与可扩展性。此外,SPIRT确保节点间通信的安全性,增强了分布式机器学习任务的可靠性。即使在拜占庭攻击下,其鲁棒聚合算法仍能维持高精度。这些发现揭示了无服务器架构在P2P分布式机器学习领域的巨大潜力,为开发更高效、可扩展且更具弹性的应用迈出了关键一步。