From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning

Federated Learning (FL) has emerged as a promising solution for data hunger in centralized learning. This paradigm enables privacy with multiple clients to train a shared-task model collaboratively without exposing their local data. While being a key component in any learning system, data is also a primary source of vulnerabilities and challenges, and a major determinant of a stable and well-converged training. Existing FL reviews describe general foundations, security practices, opportunities, challenges, and applications, without delving into diverse aspects of data and considering problems from the data perspective. They rarely provide a data-lens synthesis that links concrete data properties, split protocols, and defenses to convergence speed and stability. This survey fills that gap with three advances. First, we analyze non-IID into measurable traits and rank their influence on convergence as strong, medium, or light, explaining the mechanisms behind each and reconciling evidence across images, texts, and graphs. Second, we connect experimental splitting practices to the real phenomena they emulate, expose the artifacts they introduce, and show how those artifacts affect target accuracy. Third, we analyze how data-related vulnerabilities and their proposed defenses affect convergence, reporting performance under clean and adversarial conditions to make the convergence-robustness trade-off explicit. To our knowledge, this is the first survey to provide a complete understanding of data-related challenges that govern FL. With clear takeaways distilled for each concern, our work serves as actionable guidance, helping practitioners design their system with predictable convergence and stability.

翻译：联邦学习（FL）已成为解决集中学习中数据匮乏问题的有前景方案。该范式允许多个客户端在保护隐私的前提下，协同训练共享任务模型而无需暴露本地数据。作为任何学习系统的关键组成部分，数据既是脆弱性与挑战的主要来源，也是稳定且良好收敛训练过程的主要决定因素。现有FL综述描述了通用基础、安全实践、机遇、挑战和应用，但未深入探讨数据的多方面特性，也未从数据视角审视问题。这类综述很少提供一种将具体数据属性、分割协议与防御机制同收敛速度与稳定性相关联的数据透镜式综合。本综述通过三大进展填补这一空白。首先，我们将非独立同分布分解为可量化特征，并按其对收敛的影响程度划分为强、中、弱三级，阐明每种影响背后的机制，并调和图像、文本和图形领域中的证据。第二，我们将实验性数据分割实践与实际模拟的现象联系起来，揭示其引入的人为痕迹，并展示这些痕迹如何影响目标精度。第三，我们分析数据相关脆弱性及其所提防御机制对收敛的影响，报告在干净与对抗条件下的性能，以明确收敛-鲁棒性的权衡。据我们所知，这是首篇全面理解制约FL的数据相关挑战的综述。通过为每个关注点提炼出清晰的结论，我们的工作可作为可操作的指南，帮助从业者设计具有可预测收敛性与稳定性的系统。