Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii preserves the censoring mechanism. Across multiple datasets, we show that SurvDiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and survival model evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first end-to-end diffusion model explicitly designed for generating synthetic survival data.
翻译:生存分析是临床研究的基石,通过建模如转移、疾病复发或患者死亡等事件发生时间的结果。与标准表格数据不同,生存数据常因失访或退出研究而带有不完整的事件信息。这对合成数据生成提出了独特挑战,因为在临床研究中,忠实复现事件时间分布和删失机制至关重要。本文提出SurvDiff,一种专为生存分析中生成合成数据而设计的端到端扩散模型。SurvDiff通过联合生成混合类型协变量、事件时间和右删失,并以生存定制的损失函数为指导,专门捕捉数据生成机制。该损失函数编码了事件发生时间结构,并直接针对下游生存任务进行优化,确保SurvDiff(i)复现真实的事件时间分布,并(ii)保持删失机制。在多个数据集上的实验表明,SurvDiff在分布保真度和生存模型评估指标上均持续优于最先进的生成基线方法。据我们所知,SurvDiff是首个明确设计用于生成合成生存数据的端到端扩散模型。