Speed Up Federated Learning in Heterogeneous Environment: A Dynamic Tiering Approach

Federated learning (FL) enables collaboratively training a model while keeping the training data decentralized and private. However, one significant impediment to training a model using FL, especially large models, is the resource constraints of devices with heterogeneous computation and communication capacities as well as varying task sizes. Such heterogeneity would render significant variations in the training time of clients, resulting in a longer overall training time as well as a waste of resources in faster clients. To tackle these heterogeneity issues, we propose the Dynamic Tiering-based Federated Learning (DTFL) system where slower clients dynamically offload part of the model to the server to alleviate resource constraints and speed up training. By leveraging the concept of Split Learning, DTFL offloads different portions of the global model to clients in different tiers and enables each client to update the models in parallel via local-loss-based training. This helps reduce the computation and communication demand on resource-constrained devices and thus mitigates the straggler problem. DTFL introduces a dynamic tier scheduler that uses tier profiling to estimate the expected training time of each client, based on their historical training time, communication speed, and dataset size. The dynamic tier scheduler assigns clients to suitable tiers to minimize the overall training time in each round. We first theoretically prove the convergence properties of DTFL. We then train large models (ResNet-56 and ResNet-110) on popular image datasets (CIFAR-10, CIFAR-100, CINIC-10, and HAM10000) under both IID and non-IID systems. Extensive experimental results show that compared with state-of-the-art FL methods, DTFL can significantly reduce the training time while maintaining model accuracy.

翻译：联邦学习（FL）能够在保持训练数据去中心化且私密的前提下协同训练模型。然而，使用FL训练模型（尤其是大型模型）的一个重大障碍是设备在计算和通信能力以及任务规模方面存在的异构性限制。这种异构性会导致客户端训练时间显著差异，进而造成整体训练时间延长以及快速客户端资源浪费。为解决这些异构性问题，我们提出基于动态层级的联邦学习（DTFL）系统，其中较慢的客户端将模型部分动态卸载到服务器，以缓解资源限制并加速训练。通过利用分裂学习的概念，DTFL将全局模型的不同部分卸载到不同层级的客户端，并使得每个客户端能够通过基于局部损失的训练并行更新模型。这有助于减少资源受限设备的计算和通信需求，从而缓解掉队者问题。DTFL引入动态层级调度器，该调度器基于各客户端的历史训练时间、通信速度和数据集大小，通过层级分析预估其预期训练时间。动态层级调度器将客户端分配到合适的层级，以最小化每轮训练的整体时间。我们首先从理论上证明了DTFL的收敛特性，随后在流行的图像数据集（CIFAR-10、CIFAR-100、CINIC-10和HAM10000）上训练大型模型（ResNet-56和ResNet-110），涵盖独立同分布（IID）和非独立同分布（non-IID）系统。大量实验结果表明，与最先进的FL方法相比，DTFL能在保持模型精度的同时显著减少训练时间。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日