Motivation: Networks underlie the generation and interpretation of many biological datasets: gene networks shed light on the regulatory structure of the genome, and cell networks can capture structure of the tumor micro-environment. However, most methods that learn such networks make the faulty 'independence assumption'; to learn the gene network, they assume that no cell network exists. 'Multi-axis' methods, which do not make this assumption, fail to scale beyond a few thousand cells or genes. This limits their applicability to only the smallest datasets. Results: We develop a multi-axis method capable of processing million-cell datasets within minutes. This was previously impossible, and unlocks the use of such methods on modern scRNA-seq datasets, as well as more complex datasets. We show that our method yields novel biological insights from real single-cell data, and compares favorably to the existing hdWGCNA methodology. In particular, it identifies long non-coding RNA genes that potentially have a regulatory or functional role in neuronal development. Availability and implementation: Our methodology is available as a Python package GmGM on PyPI (https://pypi.org/project/GmGM/0.5.3/). The code for all experiments performed in this paper is available on GitHub (https://github.com/BaileyAndrew/GmGM-Bioinformatics). Contact: [email protected] Supplementary information: Our proofs, and some additional experiments, are available in the supplementary material. Keywords: gaussian graphical models, multi-axis models, transcriptomics, multi-omics, scalability
翻译:动机:网络结构是许多生物数据集生成与解释的基础:基因网络能够揭示基因组的调控结构,而细胞网络则可捕捉肿瘤微环境的结构特征。然而,大多数学习此类网络的方法都存在错误的"独立性假设"——在学习基因网络时,它们默认细胞网络不存在。不采用此假设的"多轴"方法则无法扩展到数千个细胞或基因以上,这导致其仅适用于最小规模的数据集。结果:我们开发了一种能够在数分钟内处理百万级细胞数据集的多轴方法。这在此前是无法实现的,从而使得此类方法能够应用于现代单细胞RNA测序数据集以及更复杂的数据集。我们证明,该方法能从真实单细胞数据中获得新的生物学见解,其性能优于现有的hdWGCNA方法。特别地,该方法识别出可能在神经元发育中具有调控或功能作用的长链非编码RNA基因。可用性与实现:我们的方法已作为Python软件包GmGM发布于PyPI平台(https://pypi.org/project/GmGM/0.5.3/)。本文所有实验代码均可在GitHub获取(https://github.com/BaileyAndrew/GmGM-Bioinformatics)。联系方式:[email protected] 补充信息:证明过程及补充实验详见附件材料。关键词:高斯图模型,多轴模型,转录组学,多组学,可扩展性