Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee

from arxiv, Accepted at the Journal of Machine Learning Research; compared to the previous arXiv version, this draft includes some minor clarifications/edits

Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On four standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive compared to various baselines tested in terms of time-dependent concordance index. Our code is available at: https://github.com/georgehc/survival-kernets

翻译：核生存分析模型借助核函数估计个体生存分布，该函数用于衡量任意两个数据点之间的相似性。此类核函数可通过深度核生存模型进行学习。本文提出一种新型深度核生存模型——生存核网络，其扩展至大规模数据集的方式兼具模型可解释性与理论分析可行性。具体而言，训练数据基于近期提出的分类与回归训练集压缩方案"核网络"（我们将其扩展至生存分析场景）划分为聚类簇。在测试阶段，每个数据点表示为这些聚类簇的加权组合，且每个聚类簇均可视化。针对生存核网络的特定情形，我们建立了预测生存分布的有限样本误差界，该误差界在忽略对数因子的情况下达到最优。测试阶段的可扩展性通过上述核网络压缩策略实现，而训练阶段的可扩展性则通过基于XGBoost等树集成方法的暖启动过程以及加速神经架构搜索的启发式方法达成。在四个规模各异（最大约300万数据点）的标准生存分析数据集上，生存核网络在时间依赖性一致性指数方面相较各类基线模型表现出高度竞争力。我们的代码公开于：https://github.com/georgehc/survival-kernets