Federated learning enables edge devices to train a global model collaboratively without exposing their data. Despite achieving outstanding advantages in computing efficiency and privacy protection, federated learning faces a significant challenge when dealing with non-IID data, i.e., data generated by clients that are typically not independent and identically distributed. In this paper, we tackle a new type of Non-IID data, called cluster-skewed non-IID, discovered in actual data sets. The cluster-skewed non-IID is a phenomenon in which clients can be grouped into clusters with similar data distributions. By performing an in-depth analysis of the behavior of a classification model's penultimate layer, we introduce a metric that quantifies the similarity between two clients' data distributions without violating their privacy. We then propose an aggregation scheme that guarantees equality between clusters. In addition, we offer a novel local training regularization based on the knowledge-distillation technique that reduces the overfitting problem at clients and dramatically boosts the training scheme's performance. We theoretically prove the superiority of the proposed aggregation over the benchmark FedAvg. Extensive experimental results on both standard public datasets and our in-house real-world dataset demonstrate that the proposed approach improves accuracy by up to 16% compared to the FedAvg algorithm.
翻译:联邦学习使边缘设备能够在不暴露其数据的情况下协作训练全局模型。尽管在计算效率和隐私保护方面具有显著优势,但联邦学习在处理非独立同分布(即客户端生成的数据通常不满足独立同分布假设)时仍面临重大挑战。本文针对实际数据集中发现的新型非独立同分布数据——称为聚类偏斜非独立同分布,展开研究。聚类偏斜非独立同分布是指客户端可被划分为具有相似数据分布的聚类这一现象。通过对分类模型倒数第二层行为的深入分析,我们引入了一种在不违反隐私约束前提下量化两个客户端数据分布相似性的度量方法。进而提出了一种保证聚类间平等性的聚合方案。此外,我们基于知识蒸馏技术提出了一种新颖的本地训练正则化方法,可缓解客户端的过拟合问题,并显著提升训练方案的性能。我们从理论上证明了所提聚合方案优于基准FedAvg算法。在标准公开数据集及内部真实世界数据集上的大量实验结果表明,与FedAvg算法相比,本方法最高可实现16%的准确率提升。