Survival analysis is a subfield of statistics concerned with modeling the occurrence time of a particular event of interest for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, real-world applications involve survival datasets that are distributed, incomplete, censored, and confidential. In this context, federated learning can tremendously improve the performance of survival analysis applications. Federated learning provides a set of privacy-preserving techniques to jointly train machine learning models on multiple datasets without compromising user privacy, leading to a better generalization performance. However, despite the widespread development of federated learning in recent AI research, few studies focus on federated survival analysis. In this work, we present a novel federated algorithm for survival analysis based on one of the most successful survival models, the random survival forest. We call the proposed method Federated Survival Forest (FedSurF). With a single communication round, FedSurF obtains a discriminative power comparable to deep-learning-based federated models trained over hundreds of federated iterations. Moreover, FedSurF retains all the advantages of random forests, namely low computational cost and natural handling of missing values and incomplete datasets. These advantages are especially desirable in real-world federated environments with multiple small datasets stored on devices with low computational capabilities. Numerical experiments compare FedSurF with state-of-the-art survival models in federated networks, showing how FedSurF outperforms deep-learning-based federated algorithms in realistic environments with non-identically distributed data.
翻译:生存分析是统计学的一个分支领域,关注对总体中特定感兴趣事件发生时间进行建模。生存分析在医疗保健、工程学和社会科学中有广泛应用。然而,现实应用中的生存数据集往往存在分布性、不完整、删失性和保密性特点。在此背景下,联邦学习可显著提升生存分析应用的性能。联邦学习提供了一套隐私保护技术,可在不损害用户隐私的前提下联合训练多个数据集上的机器学习模型,从而获得更好的泛化性能。尽管联邦学习在近期人工智能研究中发展广泛,但少有研究关注联邦生存分析。本文基于最成功的生存模型之一——随机生存森林,提出了一种新颖的联邦生存分析算法。我们将该方法命名为联邦生存森林(FedSurF)。仅需一轮通信,FedSurF即可获得与经过数百次联邦迭代训练的深度学习联邦模型相媲美的判别能力。此外,FedSurF保留了随机森林的所有优势,即低计算成本以及对缺失值和不完整数据集的原生处理能力。这些优势在现实联邦环境中尤为宝贵,该环境下多个小规模数据集存储于计算能力有限的设备上。数值实验将FedSurF与联邦网络中的前沿生存模型进行对比,表明在非独立同分布数据的现实环境中,FedSurF优于基于深度学习的联邦算法。