Survival analysis studies time-modeling techniques for an event of interest occurring for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, the data needed to train survival models are often distributed, incomplete, censored, and confidential. In this context, federated learning can be exploited to tremendously improve the quality of the models trained on distributed data while preserving user privacy. However, federated survival analysis is still in its early development, and there is no common benchmarking dataset to test federated survival models. This work provides a novel technique for constructing realistic heterogeneous datasets by starting from existing non-federated datasets in a reproducible way. Specifically, we propose two dataset-splitting algorithms based on the Dirichlet distribution to assign each data sample to a carefully chosen client: quantity-skewed splitting and label-skewed splitting. Furthermore, these algorithms allow for obtaining different levels of heterogeneity by changing a single hyperparameter. Finally, numerical experiments provide a quantitative evaluation of the heterogeneity level using log-rank tests and a qualitative analysis of the generated splits. The implementation of the proposed methods is publicly available in favor of reproducibility and to encourage common practices to simulate federated environments for survival analysis.
翻译:生存分析研究特定事件在人群中发生的时间建模技术,在医疗健康、工程学和社会科学领域具有广泛应用。然而,训练生存模型所需的数据通常呈现分布性、不完整性、删失性和机密性特征。在此背景下,联邦学习可在保护用户隐私的同时,显著提升基于分布式数据训练的模型质量。但联邦生存分析仍处于早期发展阶段,目前尚无统一的基准数据集用于测试联邦生存模型。本文提出一种创新技术,基于现有非联邦数据集以可复现方式构建真实异构数据集。具体而言,我们提出两种基于狄利克雷分布的数据集划分算法:数量偏斜划分和标签偏斜划分,可将各数据样本分配给精心选定的客户端。这些算法通过调整单一超参数即可获得不同异构程度。最后,数值实验采用对数秩检验对异构程度进行定量评估,并对生成的划分结果进行定性分析。本方法的实现代码已公开,旨在促进可复现性并鼓励采用通用实践来模拟生存分析的联邦环境。