We consider the problem of clustering nested or hierarchical data, where observations are grouped and there are both group-level and observation-level variables. In our motivating OneK1K dataset, observations consist of single-cell RNA-sequencing (scRNA-seq) data from 982 individuals (groups), totaling 1.27 million cells (observations), along with individual-specific genotype data. This type of data would enable the identification of cell types and the investigation of how genetic variations among individuals influence differences in cell-type profiles. Our goal, therefore, is to jointly cluster cells and individuals to capture the heterogeneity across both levels using cell-specific gene expressions as well as individual-specific genotypes. However, existing grouped clustering methods do not incorporate group-level variables, thereby limiting their ability to capture the heterogeneity of genotypes in our motivating application. To address this, we propose the Nested Atoms Model (NAM), a new Bayesian nonparametric approach that enables the desired two-layered clustering, accounting for both group-level and observation-level variables. To scale NAM for high-dimensional data, we develop a fast variational Bayesian inference algorithm. Simulations show that NAM outperforms existing methods that ignore group-level variables. Applied to the OneK1K dataset, NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles. The resulting cell clusters align with known immune cell types based on differential gene expression, underscoring the ability of NAM to capture nested heterogeneity and provide biologically meaningful insights.
翻译:我们考虑嵌套或分层数据的聚类问题,其中观测值被分组,且存在组级和观测级变量。在作为研究动因的OneK1K数据集中,观测值包含来自982名个体(组)的单细胞RNA测序(scRNA-seq)数据(总计127万个细胞,即观测值),以及个体特异性基因型数据。此类数据可支持细胞类型鉴定,并探究个体间遗传变异如何影响细胞类型谱的差异。因此,我们的目标是联合聚类细胞和个体,利用细胞特异性基因表达和个体特异性基因型来捕获两个层级的异质性。然而,现有分组聚类方法未纳入组级变量,限制了其在研究动因中捕获基因型异质性的能力。为此,我们提出嵌套原子模型(NAM),这是一种新的贝叶斯非参数方法,可实现所需的两层聚类,同时考虑组级和观测级变量。为扩展NAM对高维数据的适用性,我们开发了快速变分贝叶斯推理算法。模拟实验表明,NAM优于忽略组级变量的现有方法。将NAM应用于OneK1K数据集,可识别出具有同质性细胞类型谱的遗传相似个体聚类。所得细胞聚类基于差异基因表达结果与已知免疫细胞类型对齐,突显了NAM捕获嵌套异质性并提供生物学可解释见解的能力。