We propose a constrained maximum partial likelihood estimator for dimension reduction in integrative (e.g., pan-cancer) survival analysis with high-dimensional covariates. We assume that for each population in the study, the hazard function follows a distinct Cox proportional hazards model. To borrow information across populations, we assume that all of the hazard functions depend only on a small number of linear combinations of the predictors. We estimate these linear combinations using an algorithm based on "distance-to-set" penalties. This allows us to impose both low-rankness and sparsity. We derive asymptotic results which reveal that our regression coefficient estimator is more efficient than fitting a separate proportional hazards model for each population. Numerical experiments suggest that our method outperforms related competitors under various data generating models. We use our method to perform a pan-cancer survival analysis relating protein expression to survival across 18 distinct cancer types. Our approach identifies six linear combinations, depending on only 20 proteins, which explain survival across the cancer types. Finally, we validate our fitted model on four external datasets and show that our estimated coefficients can lead to better prediction than popular competitors.
翻译:我们提出了一种约束最大偏似然估计量,用于在高维协变量下进行整合性(如泛癌种)生存分析的降维。假设研究中每个群体的风险函数遵循不同的Cox比例风险模型。为在不同群体间共享信息,假设所有风险函数仅依赖于预测变量的少量线性组合。我们采用基于“距离到集合”惩罚的算法估计这些线性组合,从而同时实现低秩性和稀疏性。我们推导了渐近结果,表明我们的回归系数估计量比单独为每个群体拟合比例风险模型更高效。数值实验表明,在各种数据生成模型下,我们的方法优于相关竞争方法。我们将该方法用于一项泛癌种生存分析,研究18种不同癌症类型中蛋白质表达与生存的关系。该方法识别出仅依赖于20种蛋白质的六种线性组合,可解释这些癌症类型的生存差异。最后,我们在四个外部数据集上验证了所拟合模型,显示我们的估计系数比主流竞争方法能带来更优的预测性能。