Comparing estimators of discriminative performance of time-to-event models

Predicting the timing and occurrence of events is a major focus of data science applications, especially in the context of biomedical research. Performance for models estimating these outcomes, often referred to as time-to-event or survival outcomes, is frequently summarized using measures of discrimination, in particular time-dependent AUC and concordance. Many estimators for these quantities have been proposed which can be broadly categorized as either semi-parametric estimators or non-parametric estimators. In this paper, we review various estimators' mathematical construction and compare the behavior of the two classes of estimators. Importantly, we identify a previously unknown feature of the class of semi-parametric estimators that can result in vastly over-optimistic out-of-sample estimation of discriminative performance in common applied tasks. Although these semi-parametric estimators are popular in practice, the phenomenon we identify here suggests this class of estimators may be inappropriate for use in model assessment and selection based on out-of-sample evaluation criteria. This is due to the semi-parametric estimators' bias in favor of models that are overfit when using out-of-sample prediction criteria (e.g., cross validation). Non-parametric estimators, which do not exhibit this behavior, are highly variable for local discrimination. We propose to address the high variability problem through penalized regression splines smoothing. The behavior of various estimators of time-dependent AUC and concordance are illustrated via a simulation study using two different mechanisms that produce over-optimistic out-of-sample estimates using semi-parametric estimators. Estimators are further compared using a case study using data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014.

翻译：预测事件的发生时间与发生概率是数据科学应用的重要关注点，尤其在生物医学研究领域。评估此类结果（通常称为事件时间或生存结局）的模型性能，常通过判别指标进行汇总，尤其是时间依赖性AUC和一致性指数。针对这些指标已提出多种估计量，大致可分为半参数估计量与非参数估计量两类。本文系统梳理了各类估计量的数学构建原理，并对比了两类估计量的行为特征。重要的是，我们发现了半参数估计量类别中一个此前未知的特性：在常见应用任务中，该特性可能导致基于样本外数据的判别性能估计出现严重过度乐观。尽管半参数估计量在实践中广受欢迎，但本文揭示的现象表明，此类估计量可能不适用于基于样本外评估标准的模型评估与选择。究其原因，半参数估计量在使用样本外预测标准（如交叉验证）时，会对过拟合模型产生系统性偏差。而非参数估计量虽无此行为，但在局部判别估计中表现出高度变异性。我们提出通过惩罚回归样条平滑方法解决高度变异问题。通过模拟研究，我们采用两种不同机制展示了半参数估计量产生过度乐观样本外估计的过程，并利用2011-2014年美国国家健康与营养调查（NHANES）数据进行了案例研究，进一步对比了各类估计量。