Survival analysis concerns the task of predicting the time until an event occurs. Often used in the medical field, survival analysis deals with incomplete (i.e., censored) data, for instance, from patients who did not experience the event during the duration of the study. For practical use, both accuracy and interpretability are important. Survival trees are easy-to-follow survival models that split the patient cohort recursively into discrete patient groups. Whilst survival trees can capture complex relationships, they typically need to grow large, threatening interpretability. Moreover, survival trees are often built using greedy approaches that may overlook globally optimal split combinations, limiting predictive performance. Shallow survival trees require expressive, higher-order feature combinations to achieve competitive accuracy. We therefore use genetic programming to multi-objectively evolve inherently inspectable feature sets and study how they interact with different tree induction strategies. We further introduce an evolutionary approach that jointly optimises the survival tree structure and the non-linear split logic. Our findings demonstrate that evolutionary feature construction improves predictive performance across different tree induction strategies on two real-world datasets and two different survival tree depths. Given its speed and flexible presentation, the multi-objective evolution of entire trees likely holds the most future promise.
翻译:生存分析关注预测事件发生时间的问题。该领域常用于医学领域,需处理不完整(即删失)数据,例如研究中未经历事件的患者数据。在实际应用中,准确性与可解释性同等重要。生存树是一种易于理解的生存模型,通过递归方式将患者群体划分为离散的亚组。尽管生存树能够捕捉复杂关系,但其规模通常需要增长至较大程度,这威胁到可解释性。此外,传统生存树多采用贪婪方法构建,可能忽略全局最优分裂组合,从而限制预测性能。浅层生存树需要更具表达力的高阶特征组合才能达到竞争性精度。因此,我们采用遗传规划多目标进化固有可检视的特征集,并研究其与不同树诱导策略的交互作用。我们进一步提出一种联合优化生存树结构与非线性分裂逻辑的进化方法。实验结果表明,在两个真实数据集和两种不同生存树深度下,进化特征构建方法在不同树诱导策略中均能提升预测性能。鉴于其快速性与灵活的表现形式,整树多目标进化方法最具有未来应用潜力。