Towards the Practical Utility of Federated Learning in the Medical Domain

Federated learning (FL) is an active area of research. One of the most suitable areas for adopting FL is the medical domain, where patient privacy must be respected. Previous research, however, does not provide a practical guide to applying FL in the medical domain. We propose empirical benchmarks and experimental settings for three representative medical datasets with different modalities: longitudinal electronic health records, skin cancer images, and electrocardiogram signals. The likely users of FL such as medical institutions and IT companies can take these benchmarks as guides for adopting FL and minimize their trial and error. For each dataset, each client data is from a different source to preserve real-world heterogeneity. We evaluate six FL algorithms designed for addressing data heterogeneity among clients, and a hybrid algorithm combining the strengths of two representative FL algorithms. Based on experiment results from three modalities, we discover that simple FL algorithms tend to outperform more sophisticated ones, while the hybrid algorithm consistently shows good, if not the best performance. We also find that a frequent global model update leads to better performance under a fixed training iteration budget. As the number of participating clients increases, higher cost is incurred due to increased IT administrators and GPUs, but the performance consistently increases. We expect future users will refer to these empirical benchmarks to design the FL experiments in the medical domain considering their clinical tasks and obtain stronger performance with lower costs.

翻译：联邦学习是当前活跃的研究领域。医疗领域因其对患者隐私保护的严格要求，成为联邦学习最适宜的应用领域之一。然而，现有研究并未提供在医疗领域应用联邦学习的实用指南。我们针对三种不同模态的代表性医学数据集提出了实证基准和实验设定：纵向电子健康记录、皮肤癌影像和心电图信号。医疗机构和IT公司等联邦学习潜在用户可参照这些基准指南来采用联邦学习，从而最大限度减少试错成本。每个数据集中的客户端数据均来自不同来源，以保留真实世界的异质性。我们评估了六种专为解决客户端数据异质性而设计的联邦学习算法，以及一种融合两种代表性联邦学习优势的混合算法。基于三种模态的实验结果，我们发现简单联邦学习算法的表现往往优于更复杂的算法，而混合算法虽非最优但始终表现良好。我们还发现，在固定训练迭代次数预算下，更频繁的全局模型更新能带来更好的性能。随着参与客户端数量的增加，虽然IT管理人员和GPU成本会相应提高，但模型性能持续增长。我们期望未来用户能参考这些实证基准，结合其临床任务设计医疗领域联邦学习实验，以更低成本获得更优性能。