Motivation: Identification of genomic, molecular and clinical markers prognostic of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics, transcriptomics, proteomics and metabolomics, and how these potential risk factors complement clinical characterization of patient outcomes for survival prognosis. However, the massive sizes of the omics data sets, along with their correlation structures, pose challenges for studying relationships between the molecular information and patients' survival outcomes. Results: We present a general workflow for survival analysis that is applicable to high-dimensional omics data as inputs when identifying survival-associated features and validating survival models. In particular, we focus on the commonly used Cox-type penalized regressions and hierarchical Bayesian models for feature selection in survival analysis, which are are especially useful for high-dimensional data, but the framework is applicable more generally. Availability and implementation: A step-by-step R tutorial using The Cancer Genome Atlas survival and omics data for the execution and evaluation of survival models has been made available at https://ocbe-uio.github.io/survomics/survomics.html.
翻译:动机:识别与患者生存预后相关的基因组、分子及临床标志物,对于制定个性化疾病预防、诊断和治疗策略具有重要意义。现代组学技术已能研究多分子层面(包括基因组学、表观基因组学、转录组学、蛋白质组学及代谢组学)标志物的预后影响,并探索这些潜在风险因素如何补充临床特征以预测患者生存结局。然而,组学数据集的庞大规模及其相关结构,为研究分子信息与患者生存结局之间的关系带来了挑战。结果:我们提出了一套适用于生存分析的通用工作流程,能够以高维组学数据为输入,识别与生存相关的特征并验证生存模型。具体而言,我们聚焦于生存分析中特征选择常用的Cox型惩罚回归与分层贝叶斯模型——这些方法尤其适用于高维数据,但该框架具有更广泛的适用性。可用性与实现:基于癌症基因组图谱(TCGA)的生存与组学数据,我们提供了分步式R语言教程,用于执行并评估生存模型,详见https://ocbe-uio.github.io/survomics/survomics.html。