基于去噪的聚类：面向单细胞数据的潜在即插即用扩散框架 (Clustering by Denoising: Latent plug-and-play diffusion for single-cell data)

Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique "input-space steering" ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.

翻译：单细胞RNA测序（scRNA-seq）技术使得研究细胞异质性成为可能。然而，由于测量噪声和生物学变异性，聚类准确性以及基于细胞标签的下游分析仍面临挑战。在标准潜在空间（例如通过主成分分析获得的空间）中，不同细胞类型的数据可能被投影至相近位置，导致准确聚类困难。本文提出一种潜在即插即用扩散框架，将观测空间与去噪空间分离。该分离通过一种新颖的吉布斯采样过程实现：学习的扩散先验在低维潜在空间中执行去噪，同时为引导此过程，噪声被重新引入原始高维观测空间。这种独特的“输入空间引导”机制确保去噪轨迹忠实于原始数据结构。我们的方法具有三个关键优势：（1）通过先验与观测数据间的可调平衡实现自适应噪声处理；（2）通过基于原理的不确定性估计为下游分析提供量化评估；（3）利用干净参考数据对噪声更强的数据集进行泛化去噪，并通过平均化提升训练集之外的数据质量。我们在合成与真实单细胞基因组数据上评估了方法的鲁棒性。在合成数据中，本方法在不同噪声水平和数据集偏移下均提升了聚类准确性。在真实单细胞数据中，本方法生成的细胞簇展现出更强的生物学一致性，其簇边界与已知细胞类型标记物及发育轨迹具有更优的匹配度。