Federated learning is a distributed machine learning approach where local weight parameters trained by clients locally are aggregated as global parameters by a server. The global parameters can be trained without uploading privacy-sensitive raw data owned by clients to the server. The aggregation on the server is simply done by averaging the local weight parameters, so it is an I/O intensive task where a network processing accounts for a large portion compared to the computation. The network processing workload further increases as the number of clients increases. To mitigate the network processing workload, in this paper, the federated learning server is offloaded to NVIDIA BlueField-2 DPU which is a smart NIC (Network Interface Card) that has eight processing cores. Dedicated processing cores are assigned by DPDK (Data Plane Development Kit) for receiving the local weight parameters and sending the global parameters. The aggregation task is parallelized by exploiting multiple cores available on the DPU. To further improve the performance, an approximated design that eliminates an exclusive access control between the computation threads is also implemented. Evaluation results show that the federated learning server on the DPU accelerates the execution time by 1.32 times compared with that on the host CPU with a negligible accuracy loss.
翻译:联邦学习是一种分布式机器学习方法,其中由客户端本地训练的局部权重参数由服务器聚合为全局参数。无需将客户端拥有的隐私敏感原始数据上传至服务器即可训练全局参数。服务器上的聚合操作仅通过平均局部权重参数实现,因此这是一种I/O密集型任务,其中网络处理占比远高于计算处理。随着客户端数量增加,网络处理负载进一步加重。为缓解网络处理负载,本文提出将联邦学习服务器卸载至NVIDIA BlueField-2 DPU(一种拥有八个处理核心的智能网卡)。通过DPDK(数据平面开发套件)分配专用处理核心,用于接收局部权重参数和发送全局参数。通过利用DPU上的多核架构并行化聚合任务。为进一步提升性能,还实现了一种消除计算线程间独占访问控制的近似设计方案。评估结果表明,与主机CPU方案相比,DPU上的联邦学习服务器执行时间加速1.32倍,且精度损失可忽略不计。