Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework PiPar that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of PiPar and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1x, and (ii) the overall training time can be accelerated by up to 34.6x under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.
翻译:协同机器学习(CML)技术,如联邦学习,已被提出用于在多个移动设备与服务器之间训练深度学习模型。CML技术具有隐私保护特性,因为与服务器共享的是在各设备本地训练的模型而非设备原始数据。然而,由于资源利用率低下,CML训练效率不高。我们发现服务器与设备因顺序计算和通信而产生的资源闲置是导致资源利用率低下的主要原因。为此,我们开发了新型框架PiPar,该框架利用流水线并行技术显著提升CML的资源利用率。通过设计新型训练流水线,实现在不同硬件资源上并行计算、在不同带宽资源上并行通信,从而加速CML训练过程。我们还提出了一种低开销自动化参数选择方法以优化流水线,最大化可用资源利用率。实验结果证实了PiPar基础方法的有效性:与联邦学习相比,(i)服务器闲置时间最多可减少64.1倍;(ii)在六种不同规模的主流深度神经网络和四个数据集上,变化网络条件下的整体训练时间最多可加速34.6倍,且准确率无损。实验还表明,PiPar在结合差分隐私方法、运行于异构设备环境及动态带宽场景时仍能获得性能提升。