While single-cell RNA-seq enables the investigation of the celltype effect on the transcriptome, the pure tissue environmental effect has not been well investigated. The bias in the combination of tissue and celltype in the body made it difficult to evaluate the effect of pure tissue environment by omics data mining. It is important to prevent statistical confounding among discrete variables such as celltype, tissue, and other categorical variables when evaluating the effects of these variables. We propose a novel method to enumerate suitable analysis units of variables for estimating the effects of tissue environment by extending the maximal biclique enumeration problem for bipartite graphs to $k$-partite hypergraphs. We applied the proposed method to a large mouse single-cell transcriptome dataset of Tabala Muris Senis to evaluate pure tissue environmental effects on gene expression. Data Mining using the proposed method revealed pure tissue environment effects on gene expression and its age-related change among adipose sub-tissues. The method proposed in this study helps evaluations of the effects of discrete variables in exploratory data mining of large-scale genomics datasets.
翻译:尽管单细胞RNA测序技术能够研究细胞类型对转录组的影响,但纯组织环境效应尚未得到充分探究。体内组织与细胞类型组合的偏差使得通过组学数据挖掘评估纯组织环境效应变得困难。在评估细胞类型、组织及其他分类变量的影响时,防止这些离散变量间的统计混杂至关重要。我们提出一种新方法,通过将二分图的最大双团枚举问题扩展至$k$部超图,从而枚举适宜的分析单元以评估组织环境效应。我们将所提方法应用于Tabula Muris Senis大规模小鼠单细胞转录组数据集,以评估纯组织环境对基因表达的影响。使用该方法的数掘挖掘揭示了纯组织环境对基因表达的影响及其在脂肪亚组织间与年龄相关的变化。本研究提出的方法有助于在大规模基因组学数据集的探索性数据挖掘中评估离散变量的影响。