Canonical correlation analysis (CCA) is a widely used technique for estimating associations between two sets of multi-dimensional variables. Recent advancements in CCA methods have expanded their application to decipher the interactions of multiomics datasets, imaging-omics datasets, and more. However, conventional CCA methods are limited in their ability to incorporate structured patterns in the cross-correlation matrix, potentially leading to suboptimal estimations. To address this limitation, we propose the graph Canonical Correlation Analysis (gCCA) approach, which calculates canonical correlations based on the graph structure of the cross-correlation matrix between the two sets of variables. We develop computationally efficient algorithms for gCCA, and provide theoretical results for finite sample analysis of best subset selection and canonical correlation estimation by introducing concentration inequalities and stopping time rule based on martingale theories. Extensive simulations demonstrate that gCCA outperforms competing CCA methods. Additionally, we apply gCCA to a multiomics dataset of DNA methylation and RNA-seq transcriptomics, identifying both positively and negatively regulated gene expression pathways by DNA methylation pathways.
翻译:典型相关分析(CCA)是一种广泛用于估计两组多维变量间关联的技术。近年来CCA方法的进展已将其应用扩展至多组学数据集、影像组学数据集等交互作用的解析。然而,传统CCA方法在纳入交叉相关矩阵的结构化模式方面存在局限,可能导致次优估计。为克服这一限制,本文提出图典型相关分析(gCCA)方法,该方法基于两组变量间交叉相关矩阵的图结构计算典型相关。我们开发了计算高效的gCCA算法,并通过引入基于鞅理论的集中不等式与停时规则,为最优子集选择和典型相关估计的有限样本分析提供了理论结果。大量仿真实验表明gCCA优于现有CCA方法。此外,我们将gCCA应用于DNA甲基化与RNA-seq转录组学的多组学数据集,识别出受DNA甲基化通路正向与负向调控的基因表达通路。