Cancer prognosis is often based on a set of omics covariates and a set of established clinical covariates such as age and tumor stage. Combining these two sets poses challenges. First, dimension difference: clinical covariates should be favored because they are low-dimensional and usually have stronger prognostic ability than high-dimensional omics covariates. Second, interactions: genetic profiles and their prognostic effects may vary across patient subpopulations. Last, redundancy: a (set of) gene(s) may encode similar prognostic information as a clinical covariate. To address these challenges, we combine regression trees, employing clinical covariates only, with a fusion-like penalized regression framework in the leaf nodes for the omics covariates. The fusion penalty controls the variability in genetic profiles across subpopulations. We prove that the shrinkage limit of the proposed method equals a benchmark model: a ridge regression with penalized omics covariates and unpenalized clinical covariates. Furthermore, the proposed method allows researchers to evaluate, for different subpopulations, whether the overall omics effect enhances prognosis compared to only employing clinical covariates. In an application to colorectal cancer prognosis based on established clinical covariates and 20,000+ gene expressions, we illustrate the features of our method.
翻译:癌症预后通常基于一组组学协变量和一组已建立的临床协变量(如年龄和肿瘤分期)。整合这两组变量面临若干挑战。首先,维度差异:临床协变量应被优先考虑,因其维度较低且通常比高维组学协变量具有更强的预后能力。其次,交互作用:遗传谱及其预后效应可能在不同患者亚群中存在差异。最后,冗余性:一个(或一组)基因可能编码与临床协变量相似的预后信息。为应对这些挑战,我们结合了仅使用临床协变量的回归树与叶节点中针对组学协变量的类融合惩罚回归框架。融合惩罚控制了遗传谱在亚群间的变异性。我们证明了所提出方法的收缩极限等价于一个基准模型:即对组学协变量施加惩罚、对临床协变量不施加惩罚的岭回归。此外,所提方法使研究者能够评估不同亚群中,相较于仅使用临床协变量,整体组学效应是否增强了预后能力。在基于已确立的临床协变量和20,000多个基因表达数据的结直肠癌预后应用中,我们展示了本方法的特性。