Distributed Machine Learning refers to the practice of training a model on multiple computers or devices that can be called nodes. Additionally, serverless computing is a new paradigm for cloud computing that uses functions as a computational unit. Serverless computing can be effective for distributed learning systems by enabling automated resource scaling, less manual intervention, and cost reduction. By distributing the workload, distributed machine learning can speed up the training process and allow more complex models to be trained. Several topologies of distributed machine learning have been established (centralized, parameter server, peer-to-peer). However, the parameter server architecture may have limitations in terms of fault tolerance, including a single point of failure and complex recovery processes. Moreover, training machine learning in a peer-to-peer (P2P) architecture can offer benefits in terms of fault tolerance by eliminating the single point of failure. In a P2P architecture, each node or worker can act as both a server and a client, which allows for more decentralized decision making and eliminates the need for a central coordinator. In this position paper, we propose exploring the use of serverless computing in distributed machine learning training and comparing the performance of P2P architecture with the parameter server architecture, focusing on cost reduction and fault tolerance.
翻译:分布式机器学习是指在多台计算机或设备(可称为节点)上训练模型的技术。此外,无服务器计算是一种新型云计算范式,以函数作为计算单元。通过实现自动化资源扩展、减少人工干预并降低成本,无服务器计算可有效应用于分布式学习系统。通过分配工作负载,分布式机器学习能够加速训练过程,并支持训练更复杂的模型。目前已有多种分布式机器学习拓扑结构(集中式、参数服务器、对等式)。然而,参数服务器架构在容错能力方面可能存在局限性,包括单点故障和复杂的恢复流程。相比之下,采用对等(P2P)架构训练机器学习模型可通过消除单点故障提升容错能力。在P2P架构中,每个节点或工作单元同时扮演服务器和客户端角色,实现更去中心化的决策,无需中央协调器。本文作为立场论文,提议探索在分布式机器学习训练中应用无服务器计算,并重点比较P2P架构与参数服务器架构在成本降低和容错能力方面的性能表现。