A Differentially Private Kaplan-Meier Estimator for Privacy-Preserving Survival Analysis

This paper presents a differentially private approach to Kaplan-Meier estimation that achieves accurate survival probability estimates while safeguarding individual privacy. The Kaplan-Meier estimator is widely used in survival analysis to estimate survival functions over time, yet applying it to sensitive datasets, such as clinical records, risks revealing private information. To address this, we introduce a novel algorithm that applies time-indexed Laplace noise, dynamic clipping, and smoothing to produce a privacy-preserving survival curve while maintaining the cumulative structure of the Kaplan-Meier estimator. By scaling noise over time, the algorithm accounts for decreasing sensitivity as fewer individuals remain at risk, while dynamic clipping and smoothing prevent extreme values and reduce fluctuations, preserving the natural shape of the survival curve. Our results, evaluated on the NCCTG lung cancer dataset, show that the proposed method effectively lowers root mean squared error (RMSE) and enhances accuracy across privacy budgets ($\epsilon$). At $\epsilon = 10$, the algorithm achieves an RMSE as low as 0.04, closely approximating non-private estimates. Additionally, membership inference attacks reveal that higher $\epsilon$ values (e.g., $\epsilon \geq 6$) significantly reduce influential points, particularly at higher thresholds, lowering susceptibility to inference attacks. These findings confirm that our approach balances privacy and utility, advancing privacy-preserving survival analysis.

翻译：本文提出了一种用于Kaplan-Meier估计的差分隐私方法，该方法在保护个体隐私的同时实现了准确的生存概率估计。Kaplan-Meier估计器在生存分析中被广泛用于估计随时间变化的生存函数，然而将其应用于敏感数据集（如临床记录）时，存在泄露私人信息的风险。为解决此问题，我们引入了一种新颖的算法，该算法应用时间索引的拉普拉斯噪声、动态裁剪和平滑处理，以生成一个隐私保护的生存曲线，同时保持Kaplan-Meier估计器的累积结构。通过随时间调整噪声的尺度，该算法考虑了随着处于风险中的个体数量减少而降低的敏感度，而动态裁剪和平滑处理则防止了极端值并减少了波动，从而保持了生存曲线的自然形态。我们在NCCTG肺癌数据集上评估的结果表明，所提出的方法有效降低了均方根误差（RMSE），并在不同的隐私预算（$\epsilon$）下提高了准确性。当$\epsilon = 10$时，该算法的RMSE低至0.04，与非隐私估计结果非常接近。此外，成员推理攻击显示，较高的$\epsilon$值（例如$\epsilon \geq 6$）显著减少了有影响的点，尤其是在较高阈值时，从而降低了对推理攻击的易感性。这些发现证实了我们的方法在隐私与效用之间取得了平衡，推动了隐私保护生存分析的发展。