Datasets in which measurements of two (or more) types are obtained from a common set of samples arise in many scientific applications. A common problem in the exploratory analysis of such data is to identify groups of features of different data types that are strongly associated. A bimodule is a pair (A,B) of feature sets from two data types such that the aggregate cross-correlation between the features in A and those in B is large. A bimodule (A,B) is stable if A coincides with the set of features that have significant aggregate correlation with the features in B, and vice-versa. This paper proposes an iterative-testing based bimodule search procedure (BSP) to identify stable bimodules. Compared to existing methods for detecting cross-correlated features, BSP was the best at recovering true bimodules with sufficient signal, while limiting the false discoveries. In addition, we applied BSP to the problem of expression quantitative trait loci (eQTL) analysis using data from the GTEx consortium. BSP identified several thousand SNP-gene bimodules. While many of the individual SNP-gene pairs appearing in the discovered bimodules were identified by standard eQTL methods, the discovered bimodules revealed genomic subnetworks that appeared to be biologically meaningful and worthy of further scientific investigation.
翻译:在众多科学应用中,常会出现从同一组样本中获取两种(或更多)类型测量数据的数据集。对此类数据进行探索性分析时,一个常见问题是识别不同数据类型中具有强关联性的特征组。双模组是指来自两种数据类型的特征对(A,B),其组间特征A与特征B的聚合交叉相关性较大。若特征集A恰好与所有与特征B具有显著聚合相关性的特征集合一致,且反之亦然,则称该双模组(A,B)具有稳定性。本文提出一种基于迭代检验的双模组搜索流程(BSP),用于识别稳定双模组。与现有的交叉相关特征检测方法相比,BSP在有效恢复具有足够信号的真实双模组方面表现最佳,同时能限制错误发现。此外,我们利用GTEx联盟数据将BSP应用于表达数量性状位点(eQTL)分析。BSP识别出数千个SNP-基因双模组。虽然这些双模组中包含的许多单对SNP-基因关系可通过标准eQTL方法检测,但所发现的双模组揭示了具有生物学意义且值得深入研究的基因组子网络。