Penalized Deep Partially Linear Cox Models with Application to CT Scans of Lung Cancer Patients

Lung cancer is a leading cause of cancer mortality globally, highlighting the importance of understanding its mortality risks to design effective patient-centered therapies. The National Lung Screening Trial (NLST) employed computed tomography texture analysis, which provides objective measurements of texture patterns on CT scans, to quantify the mortality risks of lung cancer patients. Partially linear Cox models have gained popularity for survival analysis by dissecting the hazard function into parametric and nonparametric components, allowing for the effective incorporation of both well-established risk factors (such as age and clinical variables) and emerging risk factors (e.g., image features) within a unified framework. However, when the dimension of parametric components exceeds the sample size, the task of model fitting becomes formidable, while nonparametric modeling grapples with the curse of dimensionality. We propose a novel Penalized Deep Partially Linear Cox Model (Penalized DPLC), which incorporates the SCAD penalty to select important texture features and employs a deep neural network to estimate the nonparametric component of the model. We prove the convergence and asymptotic properties of the estimator and compare it to other methods through extensive simulation studies, evaluating its performance in risk prediction and feature selection. The proposed method is applied to the NLST study dataset to uncover the effects of key clinical and imaging risk factors on patients' survival. Our findings provide valuable insights into the relationship between these factors and survival outcomes.

翻译：肺癌是全球癌症死亡的主要原因，这凸显了理解其死亡风险以制定以患者为中心的有效疗法的重要性。国家肺部筛查试验（NLST）采用计算机断层扫描纹理分析，通过提供CT扫描纹理模式的客观测量，量化肺癌患者的死亡风险。部分线性Cox模型在生存分析中日益流行，它将风险函数分解为参数和非参数部分，从而在统一框架内有效整合既定的风险因素（如年龄和临床变量）与新兴的风险因素（如图像特征）。然而，当参数部分的维度超过样本量时，模型拟合变得困难，而非参数建模则面临维度灾难的挑战。我们提出了一种新颖的基于惩罚的深度部分线性Cox模型（Penalized DPLC），该模型引入SCAD惩罚以选择重要的纹理特征，并利用深度神经网络估计模型的非参数部分。我们证明了估计量的收敛性和渐近性质，并通过广泛的模拟研究将其与其他方法进行比较，评估其在风险预测和特征选择中的性能。所提出的方法被应用于NLST研究数据集，以揭示关键临床和影像风险因素对患者生存的影响。我们的发现为这些因素与生存结局之间的关系提供了有价值的见解。