Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On four standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive compared to various baselines tested in terms of time-dependent concordance index. Our code is available at: https://github.com/georgehc/survival-kernets
翻译:核生存分析模型借助核函数(用于度量任意两个数据点之间的相似性)来估计个体生存分布。这类核函数可以通过深度核生存模型进行学习。本文提出一种名为生存核网络的新型深度核生存模型,其能够以支持模型解释与理论分析的方式扩展至大规模数据集。具体而言,基于近期针对分类与回归任务提出的训练集压缩方案——核网络法,我们将其扩展至生存分析场景,对训练数据进行聚类划分。在测试阶段,每个数据点被表示为这些聚类的加权组合,且每个聚类均可实现可视化。针对生存核网络的特殊情形,我们建立了预测生存分布的有限样本误差界,该误差界(忽略对数因子)达到最优水平。测试阶段的可扩展性通过前述核网络压缩策略实现,而训练阶段的可扩展性则基于树集成方法(如XGBoost)的热启动过程以及加速神经架构搜索的启发式策略达成。在四个规模各异(最大约300万数据点)的标准生存分析数据集上,实验表明生存核网络在时变一致性指数方面显著优于各类基线方法。我们的代码开源地址为:https://github.com/georgehc/survival-kernets