On the Impact of Data Heterogeneity in Federated Learning Environments with Application to Healthcare Networks

Federated Learning (FL) allows multiple privacy-sensitive applications to leverage their dataset for a global model construction without any disclosure of the information. One of those domains is healthcare, where groups of silos collaborate in order to generate a global predictor with improved accuracy and generalization. However, the inherent challenge lies in the high heterogeneity of medical data, necessitating sophisticated techniques for assessment and compensation. This paper presents a comprehensive exploration of the mathematical formalization and taxonomy of heterogeneity within FL environments, focusing on the intricacies of medical data. In particular, we address the evaluation and comparison of the most popular FL algorithms with respect to their ability to cope with quantity-based, feature and label distribution-based heterogeneity. The goal is to provide a quantitative evaluation of the impact of data heterogeneity in FL systems for healthcare networks as well as a guideline on FL algorithm selection. Our research extends beyond existing studies by benchmarking seven of the most common FL algorithms against the unique challenges posed by medical data use cases. The paper targets the prediction of the risk of stroke recurrence through a set of tabular clinical reports collected by different federated hospital silos: data heterogeneity frequently encountered in this scenario and its impact on FL performance are discussed.

翻译：联邦学习（FL）允许多个隐私敏感型应用利用其数据集构建全局模型，而无需泄露任何信息。医疗领域正是此类应用场景之一——不同数据孤岛协同合作，以生成具有更高精度和泛化能力的全局预测器。然而，其固有挑战在于医疗数据的高度异构性，这需要复杂的技术手段进行评估与补偿。本文系统性地探索了FL环境中异构性的数学形式化表达与分类体系，重点聚焦于医疗数据的复杂性特征。具体而言，我们针对最常用的FL算法在应对数量型异构、特征型异构及标签分布型异构方面的能力，进行了评估与对比研究。本研究旨在为医疗网络中的FL系统提供数据异构性影响的量化评估，并为FL算法选择提供指导准则。相较于现有研究，我们的贡献在于针对医疗数据场景的特殊挑战，对七种最常见的FL算法进行了基准测试分析。本文基于不同联邦医院数据孤岛收集的表格化临床报告，以中风复发风险预测为目标场景，深入探讨了该场景中常见的数据异构性及其对FL性能的影响。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日