Federated learning (FL) enables collaborative model training without sharing raw user data, but conventional simulations often rely on unrealistic data partitioning and current user selection methods ignore data correlation among users. To address these challenges, this paper proposes a metadatadriven FL framework. We first introduce a novel data partition model based on a homogeneous Poisson point process (HPPP), capturing both heterogeneity in data quantity and natural overlap among user datasets. Building on this model, we develop a clustering-based user selection strategy that leverages metadata, such as user location, to reduce data correlation and enhance label diversity across training rounds. Extensive experiments on FMNIST and CIFAR-10 demonstrate that the proposed framework improves model performance, stability, and convergence in non-IID scenarios, while maintaining comparable performance under IID settings. Furthermore, the method shows pronounced advantages when the number of selected users per round is small. These findings highlight the framework's potential for enhancing FL performance in realistic deployments and guiding future standardization.
翻译:联邦学习(FL)支持在不共享原始用户数据的情况下进行协同模型训练,但传统仿真通常依赖不切实际的数据划分方式,且现有用户选择方法忽略了用户间的数据相关性。为应对这些挑战,本文提出一种基于元数据驱动的联邦学习框架。我们首先引入一种基于齐次泊松点过程(HPPP)的新型数据划分模型,该模型同时捕捉了数据量的异质性以及用户数据集间的自然重叠。基于此模型,我们开发了一种基于聚类的用户选择策略,该策略利用用户位置等元数据来降低数据相关性,并提升训练轮次间的标签多样性。在FMNIST和CIFAR-10数据集上的大量实验表明,所提框架在非独立同分布场景下提升了模型性能、稳定性和收敛性,同时在独立同分布设置下保持了相当的性能。此外,当每轮选择的用户数量较少时,该方法展现出显著优势。这些发现凸显了该框架在现实部署中提升联邦学习性能以及指导未来标准化工作的潜力。