A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data

This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are smaller in scale regarding the number of used datasets and extent of empirical evaluation. They often lack appropriate tuning or evaluation procedures, while other comparison studies focus on qualitative reviews rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable guidelines for practitioners. We benchmark 19 models, ranging from classical statistical approaches to many common machine learning methods, on 34 publicly available datasets. The benchmark tunes models using both a discrimination measure (Harrell's C-index) and a scoring rule (Integrated Survival Brier Score), and evaluates them across six metrics covering discrimination, calibration, and overall predictive performance. Despite superior average ranks in overall predictive performance from individual learners like oblique random survival forests and likelihood-based boosting, and better discrimination rankings from multiple boosting- and tree-based methods as well as parametric survival models, no method significantly outperforms the commonly used Cox proportional hazards model for either tuning measure. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for most practitioners. All code, data, and results are publicly available on GitHub https://github.com/slds-lmu/paper_2023_survival_benchmark

翻译：本研究首次提出了针对单事件、右删失、低维生存数据的大规模中立基准实验。基准实验在方法学研究中至关重要，能够通过适当的实证评估科学地比较新旧模型类别。现有生存分析文献中的基准实验在数据集数量和实证评估范围上规模较小，通常缺乏适当的调参或评估流程，而其他比较研究则侧重于定性综述而非定量比较。这项综合性研究旨在通过中立评估广泛的方法并为实践者提供可推广的指导原则来填补这一空白。我们在34个公开数据集上对19个模型进行了基准测试，涵盖从经典统计方法到多种常见机器学习方法。该基准使用区分度指标（Harrell's C-index）和评分规则（Integrated Survival Brier Score）对模型进行调参，并通过涵盖区分度、校准度和整体预测性能的六个指标进行评估。尽管斜随机生存森林和基于似然的提升等个体学习器在整体预测性能上具有优越的平均排名，多种基于提升和树的方法以及参数化生存模型在区分度排名上表现更佳，但没有方法在任一调参指标上显著优于常用的Cox比例风险模型。我们得出结论：在低维右删失数据的标准生存分析场景中，出于预测目的，Cox比例风险模型仍然是一种简单而稳健的方法，足以满足大多数实践者的需求。所有代码、数据和结果已在GitHub上公开：https://github.com/slds-lmu/paper_2023_survival_benchmark