A Universal Metric of Dataset Similarity for Cross-silo Federated Learning

Federated Learning is increasingly used in domains such as healthcare to facilitate collaborative model training without data-sharing. However, datasets located in different sites are often non-identically distributed, leading to degradation of model performance in FL. Most existing methods for assessing these distribution shifts are limited by being dataset or task-specific. Moreover, these metrics can only be calculated by exchanging data, a practice restricted in many FL scenarios. To address these challenges, we propose a novel metric for assessing dataset similarity. Our metric exhibits several desirable properties for FL: it is dataset-agnostic, is calculated in a privacy-preserving manner, and is computationally efficient, requiring no model training. In this paper, we first establish a theoretical connection between our metric and training dynamics in FL. Next, we extensively evaluate our metric on a range of datasets including synthetic, benchmark, and medical imaging datasets. We demonstrate that our metric shows a robust and interpretable relationship with model performance and can be calculated in privacy-preserving manner. As the first federated dataset similarity metric, we believe this metric can better facilitate successful collaborations between sites.

翻译：联邦学习越来越多地应用于医疗等领域，以在无需数据共享的情况下促进协作模型训练。然而，不同站点的数据集通常具有非独立同分布特性，导致联邦学习中的模型性能下降。现有评估这些分布偏移的方法大多受限于特定数据集或任务。此外，这些度量仅能通过交换数据来计算，而这一做法在许多联邦学习场景中受到限制。为应对这些挑战，我们提出了一种新颖的数据集相似性度量方法。该度量具有联邦学习所需的多个理想特性：与数据集无关、以隐私保护方式计算、计算高效（无需模型训练）。本文首先建立了该度量与联邦学习训练动态之间的理论联系。随后，我们在合成数据集、基准数据集和医学影像数据集等多种数据集上对其进行了广泛评估。实验表明，该度量与模型性能之间存在稳健且可解释的关系，且可在隐私保护的方式下计算。作为首个联邦数据集相似性度量，我们认为该度量能更好地促进站点间的成功协作。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日