Towards the Practical Utility of Federated Learning in the Medical Domain

Federated learning (FL) is an active area of research. One of the most suitable areas for adopting FL is the medical domain, where patient privacy must be respected. Previous research, however, does not provide a practical guide to applying FL in the medical domain. We propose empirical benchmarks and experimental settings for three representative medical datasets with different modalities: longitudinal electronic health records, skin cancer images, and electrocardiogram signals. The likely users of FL such as medical institutions and IT companies can take these benchmarks as guides for adopting FL and minimize their trial and error. For each dataset, each client data is from a different source to preserve real-world heterogeneity. We evaluate six FL algorithms designed for addressing data heterogeneity among clients, and a hybrid algorithm combining the strengths of two representative FL algorithms. Based on experiment results from three modalities, we discover that simple FL algorithms tend to outperform more sophisticated ones, while the hybrid algorithm consistently shows good, if not the best performance. We also find that a frequent global model update leads to better performance under a fixed training iteration budget. As the number of participating clients increases, higher cost is incurred due to increased IT administrators and GPUs, but the performance consistently increases. We expect future users will refer to these empirical benchmarks to design the FL experiments in the medical domain considering their clinical tasks and obtain stronger performance with lower costs.

翻译：联邦学习是当前活跃的研究领域之一，而医学领域因其对患者隐私保护的特殊需求，成为联邦学习最适用的场景之一。然而，现有研究尚未提供在医学领域应用联邦学习的实用指南。本研究针对三种具有不同模态的代表性医学数据集（纵向电子健康记录、皮肤癌影像与心电图信号）提出了经验性基准测试方法与实验设置。作为联邦学习潜在用户的医疗机构和IT企业可参考这些基准测试指导联邦学习的实践应用，从而最大限度地减少试错成本。每个数据集中，各客户端的原始数据均来自不同来源以保留真实世界的异质性。我们评估了六种针对客户端间数据异质性设计的联邦学习算法，以及一种融合两种代表性联邦学习算法优势的混合算法。基于三种模态的实验结果发现：简单联邦学习算法的表现往往优于复杂算法，而混合算法虽非最优但始终表现良好。我们还发现，在固定训练迭代预算下，频繁进行全局模型更新能获得更优性能。随着参与客户端数量的增加，虽然因IT管理员和GPU数量的增加导致成本上升，但模型性能持续提升。我们期待未来研究者能依据这些经验性基准测试，结合自身临床任务设计医学领域的联邦学习实验，从而以更低成本实现更优性能。