We propose a novel way of representing and analysing single-cell genomic count data, by modelling the observed data count matrix as a network adjacency matrix, noting that similar levels of sparsity are observed in both these types of matrices. As the adjacency matrix is equivalent to the network it represents, this perspective enables theory from stochastic networks modelling to be applied in a principled way to single-cell genomic data, providing new ways to view and analyse data of this type, and giving first-principles theoretical justification to established, successful methods. From this perspective, we show how understanding the Laplacian spectral embedding is key to both visualisation of and unsupervised learning from single-cell genomic count data. We show the success of this approach for visualisation and unsupervised learning of cellular identities in three cell-biological contexts from the epiblast/epithelial/neural lineage. New technology has made it possible to gather genomic data from single cells at unprecedented scale, and this brings with it new challenges to deal with much higher levels of heterogeneity than expected between individual cells. Novel, tailored, computational-statistical methodology, as proposed in this paper, is crucial to deriving meaningful information from these new types of data, involving collaboration between mathematical and biomedical scientists.
翻译:我们提出了一种新的单细胞基因组计数数据表示与分析方法,通过将观测到的数据计数矩阵视为网络邻接矩阵,注意到这两种矩阵具有相似的稀疏性水平。由于邻接矩阵等价于其所表征的网络,这一视角使得随机网络建模理论能够以原则性方式应用于单细胞基因组数据,为这类数据的观察与分析提供新途径,并为已有成功方法提供第一性原理的理论依据。基于这一视角,我们揭示了拉普拉斯谱嵌入在单细胞基因组计数数据的可视化和无监督学习中的关键作用。我们证明了该方法在外胚层/上皮/神经谱系三种细胞生物学情境下,对细胞身份进行可视化和无监督学习的有效性。新技术使得以前所未有的规模获取单细胞基因组数据成为可能,但也带来了新挑战:需要处理单个细胞间远高于预期的异质性。本文提出的新型定制化计算统计方法,通过数学与生物医学科学家的协作,对于从这些新型数据中提取有意义信息至关重要。