Federated Learning is increasingly used in domains such as healthcare to facilitate collaborative model training without data-sharing. However, datasets located in different sites are often non-identically distributed, leading to degradation of model performance in FL. Most existing methods for assessing these distribution shifts are limited by being dataset or task-specific. Moreover, these metrics can only be calculated by exchanging data, a practice restricted in many FL scenarios. To address these challenges, we propose a novel metric for assessing dataset similarity. Our metric exhibits several desirable properties for FL: it is dataset-agnostic, is calculated in a privacy-preserving manner, and is computationally efficient, requiring no model training. In this paper, we first establish a theoretical connection between our metric and training dynamics in FL. Next, we extensively evaluate our metric on a range of datasets including synthetic, benchmark, and medical imaging datasets. We demonstrate that our metric shows a robust and interpretable relationship with model performance and can be calculated in privacy-preserving manner. As the first federated dataset similarity metric, we believe this metric can better facilitate successful collaborations between sites.
翻译:联邦学习越来越多地应用于医疗等领域,以在无需数据共享的情况下促进协作模型训练。然而,不同站点的数据集通常具有非独立同分布特性,导致联邦学习中的模型性能下降。现有评估这些分布偏移的方法大多受限于特定数据集或任务。此外,这些度量仅能通过交换数据来计算,而这一做法在许多联邦学习场景中受到限制。为应对这些挑战,我们提出了一种新颖的数据集相似性度量方法。该度量具有联邦学习所需的多个理想特性:与数据集无关、以隐私保护方式计算、计算高效(无需模型训练)。本文首先建立了该度量与联邦学习训练动态之间的理论联系。随后,我们在合成数据集、基准数据集和医学影像数据集等多种数据集上对其进行了广泛评估。实验表明,该度量与模型性能之间存在稳健且可解释的关系,且可在隐私保护的方式下计算。作为首个联邦数据集相似性度量,我们认为该度量能更好地促进站点间的成功协作。