通过去噪实现聚类：面向单细胞数据的潜在即插即用扩散框架 (Clustering by Denoising: Latent plug-and-play diffusion for single-cell data)

Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique "input-space steering" ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.

翻译：单细胞RNA测序（scRNA-seq）技术为研究细胞异质性提供了可能。然而，由于测量噪声和生物变异性，聚类准确性以及基于细胞标签的下游分析仍然面临挑战。在标准潜在空间（例如通过主成分分析获得的空间）中，不同细胞类型的数据可能被投影到相近的位置，从而难以实现精确聚类。本文提出了一种潜在即插即用扩散框架，该框架将观测空间与去噪空间进行分离。这种分离通过一种新颖的吉布斯采样过程实现：学习到的扩散先验在低维潜在空间中执行去噪操作，同时为了引导该过程，噪声被重新引入原始高维观测空间。这种独特的“输入空间引导”机制确保去噪轨迹忠实于原始数据结构。我们的方法具有三个关键优势：（1）通过先验与观测数据之间的可调平衡实现自适应噪声处理；（2）通过基于原理的不确定性估计为下游分析提供量化评估；（3）通用化去噪能力——利用清洁参考数据对噪声更严重的数据集进行去噪，并通过平均操作提升训练集之外的数据质量。我们在合成数据与真实单细胞基因组数据上评估了方法的鲁棒性。在合成数据中，本方法在不同噪声水平和数据集偏移条件下均提升了聚类准确性。在真实单细胞数据中，本方法生成的细胞簇展现出更强的生物学一致性，其聚类边界与已知细胞类型标记物及发育轨迹具有更优的匹配度。