Federated learning (FL) has emerged as a promising approach to collaboratively train machine learning models across multiple edge devices while preserving privacy. The success of FL hinges on the efficiency of participating models and their ability to handle the unique challenges of distributed learning. While several variants of Vision Transformer (ViT) have shown great potential as alternatives to modern convolutional neural networks (CNNs) for centralized training, the unprecedented size and higher computational demands hinder their deployment on resource-constrained edge devices, challenging their widespread application in FL. Since client devices in FL typically have limited computing resources and communication bandwidth, models intended for such devices must strike a balance between model size, computational efficiency, and the ability to adapt to the diverse and non-IID data distributions encountered in FL. To address these challenges, we propose OnDev-LCT: Lightweight Convolutional Transformers for On-Device vision tasks with limited training data and resources. Our models incorporate image-specific inductive biases through the LCT tokenizer by leveraging efficient depthwise separable convolutions in residual linear bottleneck blocks to extract local features, while the multi-head self-attention (MHSA) mechanism in the LCT encoder implicitly facilitates capturing global representations of images. Extensive experiments on benchmark image datasets indicate that our models outperform existing lightweight vision models while having fewer parameters and lower computational demands, making them suitable for FL scenarios with data heterogeneity and communication bottlenecks.
翻译:联邦学习(FL)已成为一种在多个边缘设备上协作训练机器学习模型同时保护隐私的可行方案。FL的成功取决于参与模型的效率及其应对分布式学习特有挑战的能力。尽管多种Vision Transformer(ViT)变体在集中式训练中展现出作为现代卷积神经网络(CNN)替代方案的巨大潜力,但其空前的规模和较高的计算需求阻碍了其在资源受限的边缘设备上的部署,限制了它们在FL中的广泛应用。由于FL中的客户端设备通常计算资源有限且通信带宽受限,为此类设备设计的模型必须在模型规模、计算效率及适应FL中多样化非独立同分布数据分布的能力之间取得平衡。为应对这些挑战,我们提出OnDev-LCT:适用于训练数据和资源受限的设备端视觉任务的轻量级卷积Transformer。我们的模型通过LCT分词器引入图像特定归纳偏置,利用残差线性瓶颈块中的高效深度可分离卷积提取局部特征,同时LCT编码器中的多头自注意力(MHSA)机制隐式促进图像全局表征的捕捉。在基准图像数据集上的大量实验表明,我们的模型在参数更少、计算需求更低的情况下优于现有轻量级视觉模型,使其适用于存在数据异构性和通信瓶颈的FL场景。