Collaborative Machine Learning is a paradigm in the field of distributed machine learning, designed to address the challenges of data privacy, communication overhead, and model heterogeneity. There have been significant advancements in optimization and communication algorithm design and ML hardware that enables fair, efficient and secure collaborative ML training. However, less emphasis is put on collaborative ML infrastructure development. Developers and researchers often build server-client systems for a specific collaborative ML use case, which is not scalable and reusable. As the scale of collaborative ML grows, the need for a scalable, efficient, and ideally multi-tenant resource management system becomes more pressing. We propose a novel system, Propius, that can adapt to the heterogeneity of client machines, and efficiently manage and control the computation flow between ML jobs and edge resources in a scalable fashion. Propius is comprised of a control plane and a data plane. The control plane enables efficient resource sharing among multiple collaborative ML jobs and supports various resource sharing policies, while the data plane improves the scalability of collaborative ML model sharing and result collection. Evaluations show that Propius outperforms existing resource management techniques and frameworks in terms of resource utilization (up to $1.88\times$), throughput (up to $2.76$), and job completion time (up to $1.26\times$).
翻译:协同机器学习是分布式机器学习领域的一种范式,旨在应对数据隐私、通信开销和模型异质性等挑战。在优化与通信算法设计以及支持公平、高效、安全协同ML训练的硬件方面已取得显著进展。然而,协同ML基础设施的开发尚未得到足够重视。开发者和研究人员常为特定协同ML用例构建服务器-客户端系统,这类系统缺乏可扩展性和可复用性。随着协同ML规模的扩大,对可扩展、高效且理想情况下支持多租户的资源管理系统的需求日益迫切。我们提出了一种新颖的系统Propius,该系统能够适应客户端机器的异质性,并以可扩展的方式高效管理和控制ML任务与边缘资源间的计算流。Propius由控制平面和数据平面构成。控制平面支持多个协同ML任务间的高效资源共享,并兼容多种资源共享策略;数据平面则提升了协同ML模型共享与结果收集的可扩展性。评估结果表明,Propius在资源利用率(最高达$1.88\times$)、吞吐量(最高达$2.76$)和任务完成时间(最高达$1.26\times$)方面均优于现有资源管理技术与框架。