Federated learning (FL) enables collaboratively training a model while keeping the training data decentralized and private. However, one significant impediment to training a model using FL, especially large models, is the resource constraints of devices with heterogeneous computation and communication capacities as well as varying task sizes. Such heterogeneity would render significant variations in the training time of clients, resulting in a longer overall training time as well as a waste of resources in faster clients. To tackle these heterogeneity issues, we propose the Dynamic Tiering-based Federated Learning (DTFL) system where slower clients dynamically offload part of the model to the server to alleviate resource constraints and speed up training. By leveraging the concept of Split Learning, DTFL offloads different portions of the global model to clients in different tiers and enables each client to update the models in parallel via local-loss-based training. This helps reduce the computation and communication demand on resource-constrained devices and thus mitigates the straggler problem. DTFL introduces a dynamic tier scheduler that uses tier profiling to estimate the expected training time of each client, based on their historical training time, communication speed, and dataset size. The dynamic tier scheduler assigns clients to suitable tiers to minimize the overall training time in each round. We first theoretically prove the convergence properties of DTFL. We then train large models (ResNet-56 and ResNet-110) on popular image datasets (CIFAR-10, CIFAR-100, CINIC-10, and HAM10000) under both IID and non-IID systems. Extensive experimental results show that compared with state-of-the-art FL methods, DTFL can significantly reduce the training time while maintaining model accuracy.
翻译:联邦学习(FL)能够在保持训练数据去中心化且私密的前提下协同训练模型。然而,使用FL训练模型(尤其是大型模型)的一个重大障碍是设备在计算和通信能力以及任务规模方面存在的异构性限制。这种异构性会导致客户端训练时间显著差异,进而造成整体训练时间延长以及快速客户端资源浪费。为解决这些异构性问题,我们提出基于动态层级的联邦学习(DTFL)系统,其中较慢的客户端将模型部分动态卸载到服务器,以缓解资源限制并加速训练。通过利用分裂学习的概念,DTFL将全局模型的不同部分卸载到不同层级的客户端,并使得每个客户端能够通过基于局部损失的训练并行更新模型。这有助于减少资源受限设备的计算和通信需求,从而缓解掉队者问题。DTFL引入动态层级调度器,该调度器基于各客户端的历史训练时间、通信速度和数据集大小,通过层级分析预估其预期训练时间。动态层级调度器将客户端分配到合适的层级,以最小化每轮训练的整体时间。我们首先从理论上证明了DTFL的收敛特性,随后在流行的图像数据集(CIFAR-10、CIFAR-100、CINIC-10和HAM10000)上训练大型模型(ResNet-56和ResNet-110),涵盖独立同分布(IID)和非独立同分布(non-IID)系统。大量实验结果表明,与最先进的FL方法相比,DTFL能在保持模型精度的同时显著减少训练时间。