Recent advances in machine learning have highlighted Federated Learning (FL) as a promising approach that enables multiple distributed users (so-called clients) to collectively train ML models without sharing their private data. While this privacy-preserving method shows potential, it struggles when data across clients is not independent and identically distributed (non-IID) data. The latter remains an unsolved challenge that can result in poorer model performance and slower training times. Despite the significance of non-IID data in FL, there is a lack of consensus among researchers about its classification and quantification. This technical survey aims to fill that gap by providing a detailed taxonomy for non-IID data, partition protocols, and metrics to quantify data heterogeneity. Additionally, we describe popular solutions to address non-IID data and standardized frameworks employed in FL with heterogeneous data. Based on our state-of-the-art survey, we present key lessons learned and suggest promising future research directions.
翻译:机器学习的最新进展凸显了联邦学习作为一种前景广阔的方法,它允许多个分布式用户(即客户端)在不共享私有数据的情况下协同训练机器学习模型。尽管这种隐私保护方法展现出潜力,但当客户端间的数据不满足独立同分布条件时,其性能仍面临严峻挑战。非独立同分布数据问题至今仍是未解的难题,可能导致模型性能下降与训练速度减缓。尽管非独立同分布数据在联邦学习中至关重要,但学界对其分类与量化方式尚未形成共识。本技术综述旨在填补这一空白,通过构建非独立同分布数据的详细分类体系、数据划分协议以及量化数据异质性的度量指标。此外,我们系统阐述了应对非独立同分布数据的常用解决方案,以及处理异构数据时采用的标准化联邦学习框架。基于对前沿研究的全面梳理,我们总结了关键经验教训,并提出了具有潜力的未来研究方向。