Survival analysis, or time-to-event analysis, is an important and widespread problem in healthcare research. Medical research has traditionally relied on Cox models for survival analysis, due to their simplicity and interpretability. Cox models assume a log-linear hazard function as well as proportional hazards over time, and can perform poorly when these assumptions fail. Newer survival models based on machine learning avoid these assumptions and offer improved accuracy, yet sometimes at the expense of model interpretability, which is vital for clinical use. We propose a novel survival analysis pipeline that is both interpretable and competitive with state-of-the-art survival models. Specifically, we use an improved version of survival stacking to transform a survival analysis problem to a classification problem, ControlBurn to perform feature selection, and Explainable Boosting Machines to generate interpretable predictions. To evaluate our pipeline, we predict risk of heart failure using a large-scale EHR database. Our pipeline achieves state-of-the-art performance and provides interesting and novel insights about risk factors for heart failure.
翻译:生存分析,或称时间-事件分析,是医疗健康研究中一项重要且广泛应用的问题。传统医学研究因Cox模型简洁且具备可解释性而广泛采用该模型进行生存分析。Cox模型假定对数线性风险函数以及风险随时间成比例,当这些假设不成立时其预测效果可能较差。基于机器学习的新型生存模型避免这些假设并提升了准确性,但有时会牺牲对临床使用至关重要的模型可解释性。我们提出了一种兼具可解释性与先进模型竞争力的新型生存分析流程。具体而言,我们采用改进版生存堆叠法将生存分析问题转化为分类问题,利用ControlBurn进行特征选择,并通过可解释增强机生成可解释预测结果。为评估该流程,我们基于大规模电子健康记录数据库预测心力衰竭风险。实验表明,该流程不仅达到最优性能,还揭示了心力衰竭风险因素的新颖洞见。