Statistical approaches that successfully combine multiple datasets are more powerful, efficient, and scientifically informative than separate analyses. To address variation architectures correctly and comprehensively for high-dimensional data across multiple sample sets (i.e., cohorts), we propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method to concurrently learn both covariate-driven and auxiliary structured variation. We consider a structured nuclear norm objective that is motivated by random matrix theory, in which the regression or factorization terms may be shared or specific to any number of cohorts. Our framework subsumes several existing methods, such as reduced rank regression and unsupervised multi-matrix factorization approaches, and includes a promising novel approach to regression and factorization of a single dataset (aRRR) as a special case. Simulations demonstrate substantial gains in power from combining multiple datasets, and from parsimoniously accounting for all structured variation. We apply maRRR to gene expression data from multiple cancer types (i.e., pan-cancer) from TCGA, with somatic mutations as covariates. The method performs well with respect to prediction and imputation of held-out data, and provides new insights into mutation-driven and auxiliary variation that is shared or specific to certain cancer types.
翻译:能够成功整合多个数据集的统计方法比单独分析更强大、高效且具有科学信息性。为了正确且全面地处理跨多个样本集(即队列)的高维数据变异结构,我们提出多种增强型降秩回归(maRRR),这是一种灵活的矩阵回归与分解方法,可同时学习协变量驱动变异和辅助结构化变异。我们考虑一种由随机矩阵理论启发的结构化核范数目标函数,其中回归或分解项可以共享于任意数量的队列,或特定于某些队列。我们的框架涵盖了几种现有方法,如降秩回归和无监督多矩阵分解方法,并将一种针对单数据集回归与分解的创新方法(aRRR)作为特例。模拟实验表明,结合多个数据集并简洁地考虑所有结构化变异可显著提升统计功效。我们将maRRR应用于TCGA中多种癌症类型(即泛癌)的基因表达数据,并以体细胞突变作为协变量。该方法在预测和插补保留数据方面表现良好,并为揭示共享于特定癌症类型或特异的突变驱动变异和辅助变异提供了新见解。