The identification of genomic, molecular and clinical markers predictive of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics (e.g. DNA methylation), transcriptomics, proteomics and metabolomics, and how these potential risk factors complement clinical characterization of patients for survival prognosis. However, the massive sizes of the omics data sets pose challenges for studying relationships between the molecular information and patients' survival outcomes. We present a general workflow for survival analysis, with emphasis on dealing with high-dimensional omics data as inputs when identifying survival-associated omics features and validating survival models. In particular, we focus on commonly used Cox-type penalized regressions and hierarchical Bayesian models for feature selection in survival analysis, but the framework and pipeline are applicable more generally. In cases where multi-omics data are available for survival modelling, an extra caution is needed to account for the underlying structure both within and between the omics data sets and features. A step-by-step R tutorial using The Cancer Genome Atlas survival and omics data for the execution and evaluation of survival models has been made available at \url{https://ocbe-uio.github.io/survomics/survomics.html}.
翻译:识别预测患者生存的基因组、分子和临床标志物对于开发个性化疾病预防、诊断和治疗方法至关重要。现代组学技术使得在多个分子水平(包括基因组学、表观基因组学(如DNA甲基化)、转录组学、蛋白质组学和代谢组学)上研究标志物的预后影响成为可能,并探讨这些潜在风险因素如何补充患者的临床特征以进行生存预后。然而,组学数据集的庞大规模给研究分子信息与患者生存结局之间的关系带来了挑战。我们提出了一种通用的生存分析工作流程,重点在于处理高维组学数据作为输入,以识别与生存相关的组学特征并验证生存模型。特别地,我们聚焦于生存分析中常用的Cox型惩罚回归和层次贝叶斯模型进行特征选择,但该框架和流程具有更广泛的适用性。当多组学数据可用于生存建模时,需要格外注意组学数据集内部及之间的底层结构。我们提供了基于癌症基因组图谱生存与组学数据的逐步R教程,用于执行和评估生存模型,详见\url{https://ocbe-uio.github.io/survomics/survomics.html}。