The current state of evaluation in survival analysis is plagued by the persistent use of evaluation metrics in ways that are misaligned with the stated modeling objective. In addition, many such evaluations are based on censoring assumptions that are left implicit or unjustified. This means that the reported performance can be misleading and may fail to answer the scientific or modeling question the evaluation was intended to address. In this position paper, we critically examine evaluation practices in survival analysis and highlight how censoring makes evaluation fundamentally different from standard regression or classification. We place particular focus on concordance-based measures, such as the C-index, which we show are heavily overused in the literature. To help identify appropriate metrics, we propose a set of key desiderata and introduce a double-helix ladder, in which valid evaluation requires alignment between metric and modeling assumptions. Through controlled experiments, we show that violations of this alignment can lead to misleading model comparisons. We conclude by providing practical guidance on how to evaluate a survival model.
翻译:当前生存分析领域的评估现状深受评估指标使用方式与既定建模目标错位之困扰。此外,诸多评估基于未经明示或缺乏合理依据的删失假设,导致报告的性能可能具有误导性,且无法回答评估原本旨在解决的科学问题或建模问题。在本立场论文中,我们批判性地审视了生存分析的评估实践,并强调删失如何使评估从根本上区别于标准回归或分类问题。我们特别聚焦于基于一致性的度量指标(如C指数),发现其在文献中被过度使用。为帮助识别恰当指标,我们提出一组关键准则并引入双螺旋梯模型——有效评估要求指标与建模假设保持对齐。通过控制实验,我们展示了这种对齐的违规会如何导致具有误导性的模型比较。最后,我们就如何评估生存模型提供了实用指导。