Generalized Bayesian nonparametric clustering framework for high-dimensional spatial omics data

The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has transformed genomic research by enabling high-throughput gene expression profiling while preserving spatial context. Identifying spatial domains within SRT data is a critical task, with numerous computational approaches currently available. However, most existing methods rely on a multi-stage process that involves ad-hoc dimension reduction techniques to manage the high dimensionality of SRT data. These low-dimensional embeddings are then subjected to model-based or distance-based clustering methods. Additionally, many approaches depend on arbitrarily specifying the number of clusters (i.e., spatial domains), which can result in information loss and suboptimal downstream analysis. To address these limitations, we propose a novel Bayesian nonparametric mixture of factor analysis (BNPMFA) model, which incorporates a Markov random field-constrained Gibbs-type prior for partitioning high-dimensional spatial omics data. This new prior effectively integrates the spatial constraints inherent in SRT data while simultaneously inferring cluster membership and determining the optimal number of spatial domains. We have established the theoretical identifiability of cluster membership within this framework. The efficacy of our proposed approach is demonstrated through realistic simulations and applications to two SRT datasets. Our results show that the BNPMFA model not only surpasses state-of-the-art methods in clustering accuracy and estimating the number of clusters but also offers novel insights for identifying cellular regions within tissue samples.

翻译：基于新一代测序的空间分辨转录组学技术的出现，通过实现高通量基因表达谱分析并同时保留空间背景，彻底改变了基因组学研究。在SRT数据中识别空间域是一项关键任务，目前已有多种计算方法。然而，现有方法大多依赖于多阶段处理流程，其中包含用于管理SRT数据高维特性的临时降维技术。这些低维嵌入随后被应用于基于模型或基于距离的聚类方法。此外，许多方法需要任意指定聚类数量（即空间域），这可能导致信息丢失和下游分析结果欠佳。为应对这些局限性，我们提出了一种新颖的贝叶斯非参数因子分析混合模型，该模型结合了马尔可夫随机场约束的吉布斯型先验，用于划分高维空间组学数据。这一新先验有效整合了SRT数据固有的空间约束，同时推断聚类归属并确定最优空间域数量。我们已在该框架内建立了聚类成员可辨识性的理论基础。通过实际模拟和两个SRT数据集的应用，证明了我们提出方法的有效性。结果表明，BNPMFA模型不仅在聚类精度和聚类数量估计方面超越了现有先进方法，还为识别组织样本内的细胞区域提供了新的见解。