Clustering of single-cell RNA sequencing (scRNA-seq) datasets can give key insights into the biological functions of cells. Therefore, it is not surprising that network-based community detection methods (one of the better clustering methods) are increasingly being used for the clustering of scRNA-seq datasets. The main challenge in implementing network-based community detection methods for scRNA-seq datasets is that these methods \emph{apriori} require the true number of communities or blocks for estimating the community memberships. Although there are existing methods for estimating the number of communities, they are not suitable for noisy scRNA-seq datasets. Moreover, we require an appropriate method for extracting suitable networks from scRNA-seq datasets. For addressing these issues, we present a two-fold solution: i) a simple likelihood-based approach for extracting stochastic block models (SBMs) out of scRNA-seq datasets, ii) a new sequential multiple testing (SMT) method for estimating the number of communities in SBMs. We study the theoretical properties of SMT and establish its consistency under moderate sparsity conditions. In addition, we compare the numerical performance of the SMT with several existing methods. We also show that our approach performs competitively well against existing methods for estimating the number of communities on benchmark scRNA-seq datasets. Finally, we use our approach for estimating subgroups of a human retina bipolar single cell dataset.
翻译:单细胞RNA测序(scRNA-seq)数据集的聚类分析能够为细胞生物学功能提供关键洞见。因此,基于网络的社区检测方法(作为更优的聚类方法之一)正日益广泛地应用于scRNA-seq数据集的聚类分析。在scRNA-seq数据集中实施基于网络的社区检测方法的主要挑战在于,这些方法需要预先获知真实的社区或区块数量才能估计社区成员归属。尽管现有方法可用于估计社区数量,但它们并不适用于存在噪声的scRNA-seq数据集。此外,我们需要一种合适的方法从scRNA-seq数据集中提取有效的网络结构。针对这些问题,我们提出一种双重解决方案:i)基于简单似然估计的方法从scRNA-seq数据集中提取随机块模型(SBM);ii)用于估计SBM中社区数量的新型序贯多重检验(SMT)方法。我们研究了SMT的理论性质,并在中等稀疏性条件下证明了其一致性。此外,我们将SMT与多种现有方法的数值性能进行了比较。实验表明,在基准scRNA-seq数据集上,我们的方法在社区数量估计方面与现有方法相比具有竞争优势。最后,我们将该方法应用于人类视网膜双极细胞单细胞数据集的亚群估计。