Federated Learning (FL) is a novel approach enabling several clients holding sensitive data to collaboratively train machine learning models, without centralizing data. The cross-silo FL setting corresponds to the case of few ($2$--$50$) reliable clients, each holding medium to large datasets, and is typically found in applications such as healthcare, finance, or industry. While previous works have proposed representative datasets for cross-device FL, few realistic healthcare cross-silo FL datasets exist, thereby slowing algorithmic research in this critical application. In this work, we propose a novel cross-silo dataset suite focused on healthcare, FLamby (Federated Learning AMple Benchmark of Your cross-silo strategies), to bridge the gap between theory and practice of cross-silo FL. FLamby encompasses 7 healthcare datasets with natural splits, covering multiple tasks, modalities, and data volumes, each accompanied with baseline training code. As an illustration, we additionally benchmark standard FL algorithms on all datasets. Our flexible and modular suite allows researchers to easily download datasets, reproduce results and re-use the different components for their research. FLamby is available at~\url{www.github.com/owkin/flamby}.
翻译:摘要:联邦学习(FL)是一种新颖的方法,使多个持有敏感数据的客户端能够在无需集中数据的情况下协作训练机器学习模型。跨孤岛FL设置对应少数(2–50个)可靠客户端的情形,每个客户端持有中到大规模数据集,常见于医疗、金融或工业等应用场景中。尽管以往工作已提出具有代表性的跨设备FL数据集,但现实医疗跨孤岛FL数据集仍较为稀缺,从而阻碍了该关键应用的算法研究。本文提出一个专注于医疗领域的新型跨孤岛数据集套件FLamby(跨孤岛策略联邦学习基准大全),旨在弥合跨孤岛FL理论与实践的鸿沟。FLamby包含7个具有自然数据划分的医疗数据集,覆盖多种任务、模态和数据量,并提供基线训练代码。作为示例,我们进一步在所有数据集上对标准FL算法进行基准测试。该灵活模块化的套件使研究人员能够轻松下载数据集、复现结果并重用不同组件用于自身研究。FLamby可通过~\url{www.github.com/owkin/flamby}获取。