Survival analysis is a subfield of statistics concerned with modeling the occurrence time of a particular event of interest for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, real-world applications involve survival datasets that are distributed, incomplete, censored, and confidential. In this context, federated learning can tremendously improve the performance of survival analysis applications. Federated learning provides a set of privacy-preserving techniques to jointly train machine learning models on multiple datasets without compromising user privacy, leading to a better generalization performance. Despite the widespread development of federated learning in recent AI research, only a few studies focus on federated survival analysis. In this work, we present a novel federated algorithm for survival analysis based on one of the most successful survival models, the random survival forest. We call the proposed method Federated Survival Forest (FedSurF). With a single communication round, FedSurF obtains a discriminative power comparable to deep-learning-based federated models trained over hundreds of federated iterations. Moreover, FedSurF retains all the advantages of random forests, namely low computational cost and natural handling of missing values and incomplete datasets. These advantages are especially desirable in real-world federated environments with multiple small datasets stored on devices with low computational capabilities. Numerical experiments compare FedSurF with state-of-the-art survival models in federated networks, showing how FedSurF outperforms deep-learning-based federated algorithms in realistic environments with non-identically distributed data.
翻译:生存分析是统计学的一个子领域,关注对人群中特定感兴趣事件的发生时间进行建模。生存分析在医疗保健、工程和社会科学中得到了广泛应用。然而,实际应用中的生存数据集往往具有分布性、不完整性、删失性和保密性。在此背景下,联邦学习能显著提升生存分析应用的性能。联邦学习提供了一系列隐私保护技术,可在不损害用户隐私的前提下,在多个数据集上联合训练机器学习模型,从而实现更好的泛化性能。尽管联邦学习在近期的人工智能研究中得到了广泛发展,但仅有少数研究聚焦于联邦生存分析。本文提出了一种基于最成功的生存模型之一——随机生存森林的新型联邦生存分析算法。我们将所提方法命名为联邦生存森林(Federated Survival Forest, FedSurF)。通过单次通信轮次,FedSurF即可获得与经过数百轮联邦迭代训练的基于深度学习的联邦模型相当的判别能力。此外,FedSurF保留了随机森林的所有优势,即低计算成本以及对缺失值和不完整数据集的自然处理能力。这些优势在存储于低计算能力设备上的多个小数据集的现实联邦环境中尤为理想。数值实验将FedSurF与联邦网络中最先进的生存模型进行了比较,结果表明,在数据非独立同分布的现实环境中,FedSurF优于基于深度学习的联邦算法。