We propose Cooperative Component Analysis (CoCA), a new method for unsupervised multi-view analysis: it identifies the component that simultaneously captures significant within-view variance and exhibits strong cross-view correlation. The challenge of integrating multi-view data is particularly important in biology and medicine, where various types of "-omic" data, ranging from genomics to proteomics, are measured on the same set of samples. The goal is to uncover important, shared signals that represent underlying biological mechanisms. CoCA combines an approximation error loss to preserve information within data views and an "agreement penalty" to encourage alignment across data views. By balancing the trade-off between these two key components in the objective, CoCA has the property of interpolating between the commonly-used principal component analysis (PCA) and canonical correlation analysis (CCA) as special cases at the two ends of the solution path. CoCA chooses the degree of agreement in a data-adaptive manner, using a validation set or cross-validation to estimate test error. Furthermore, we propose a sparse variant of CoCA that incorporates the Lasso penalty to yield feature sparsity, facilitating the identification of key features driving the observed patterns. We demonstrate the effectiveness of CoCA on simulated data and two real multiomics studies of COVID-19 and ductal carcinoma in situ of breast. In both real data applications, CoCA successfully integrates multiomics data, extracting components that are not only consistently present across different data views but also more informative and predictive of disease progression. CoCA offers a powerful framework for discovering important shared signals in multi-view data, with the potential to uncover novel insights in an increasingly multi-view data world.
翻译:我们提出协同成分分析(CoCA),一种用于无监督多视图分析的新方法:它能够识别同时捕获显著视图内方差并展现强跨视图相关性的成分。多视图数据整合的挑战在生物学和医学领域尤为重要,其中从基因组学到蛋白质组学等多种类型的“组学”数据均在相同样本集上测量。目标是揭示代表潜在生物学机制的重要共享信号。CoCA结合了近似误差损失以保留数据视图内的信息,以及“一致性惩罚”以促进跨数据视图的对齐。通过在目标函数中平衡这两个关键组成部分的权衡,CoCA具有在常用主成分分析(PCA)和典型相关分析(CCA)之间插值的特性,将二者作为解路径两端的特例。CoCA以数据自适应方式选择一致程度,使用验证集或交叉验证来估计测试误差。此外,我们提出CoCA的稀疏变体,该变体引入Lasso惩罚以产生特征稀疏性,从而促进识别驱动观测模式的关键特征。我们在模拟数据以及COVID-19和乳腺导管原位癌的两项真实多组学研究中验证了CoCA的有效性。在两项真实数据应用中,CoCA成功整合了多组学数据,提取的成分不仅在不同数据视图间一致存在,而且更具信息量并能更好地预测疾病进展。CoCA为发现多视图数据中的重要共享信号提供了一个强大框架,有望在日益增长的多视图数据世界中揭示新的见解。