Scaling Survival Analysis in Healthcare with Federated Survival Forests: A Comparative Study on Heart Failure and Breast Cancer Genomics

Survival analysis is a fundamental tool in medicine, modeling the time until an event of interest occurs in a population. However, in real-world applications, survival data are often incomplete, censored, distributed, and confidential, especially in healthcare settings where privacy is critical. The scarcity of data can severely limit the scalability of survival models to distributed applications that rely on large data pools. Federated learning is a promising technique that enables machine learning models to be trained on multiple datasets without compromising user privacy, making it particularly well-suited for addressing the challenges of survival data and large-scale survival applications. Despite significant developments in federated learning for classification and regression, many directions remain unexplored in the context of survival analysis. In this work, we propose an extension of the Federated Survival Forest algorithm, called FedSurF++. This federated ensemble method constructs random survival forests in heterogeneous federations. Specifically, we investigate several new tree sampling methods from client forests and compare the results with state-of-the-art survival models based on neural networks. The key advantage of FedSurF++ is its ability to achieve comparable performance to existing methods while requiring only a single communication round to complete. The extensive empirical investigation results in a significant improvement from the algorithmic and privacy preservation perspectives, making the original FedSurF algorithm more efficient, robust, and private. We also present results on two real-world datasets demonstrating the success of FedSurF++ in real-world healthcare studies. Our results underscore the potential of FedSurF++ to improve the scalability and effectiveness of survival analysis in distributed settings while preserving user privacy.

翻译：生存分析是医学领域的基础工具，用于建模人群中感兴趣事件发生的时间。然而实际应用中，生存数据常呈现不完整性、删失性、分布性和保密性特征，尤其在医疗场景中隐私保护至关重要。数据稀缺性会严重制约生存模型在依赖大规模数据池的分布式应用中的可扩展性。联邦学习作为极具前景的技术，能在不泄露用户隐私的前提下实现多数据集上的机器学习模型训练，尤其适合解决生存数据面临的挑战及大规模生存分析应用。尽管联邦学习在分类和回归任务中取得显著进展，但在生存分析领域仍有诸多方向亟待探索。本研究提出联邦生存森林算法的扩展版本FedSurF++，该联邦集成方法可在异构联邦中构建随机生存森林。具体而言，我们探索了从客户端森林中选取新型树采样方法，并与基于神经网络的先进生存模型进行对比。FedSurF++的核心优势在于仅需单轮通信即可达到与现有方法相当的预测性能。通过大量实证研究，该算法在算法优化和隐私保护方面取得显著提升，使原始FedSurF算法更高效、鲁棒且保护隐私。我们在两个真实医疗数据集上的实验结果证明了FedSurF++在真实医疗研究中的有效性。研究结果凸显了FedSurF++在分布式环境中提升生存分析可扩展性与有效性、同时保护用户隐私的巨大潜力。