Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.
翻译:联邦学习(Federated Learning, FL)是一种分布式机器学习范式,客户端利用本地的(人类生成的)数据集协作训练模型。现有研究主要聚焦于应对客户端间数据异质性的联邦学习算法开发,但数据质量(例如标签噪声)这一重要问题在联邦学习中被忽视了。本文旨在弥补这一空白,通过量化研究标签噪声对联邦学习的影响。我们推导出一个泛化误差的上界,该上界与客户端的标签噪声水平呈线性关系。随后,我们使用多种联邦学习算法在MNIST和CIFAR-10数据集上进行了实验。实验结果与理论分析一致,表明全局模型准确率随噪声水平升高而线性下降。我们进一步发现,标签噪声会减缓联邦学习的训练收敛速度,并且在噪声水平较高时,全局模型倾向于过拟合。