Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data. SMAI provides a statistical test to robustly determine the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.
翻译:单细胞数据整合能够提供细胞的全面分子视图,目前已开发出多种算法以去除不期望的技术或生物学变异,并整合异质性单细胞数据集。尽管这些方法应用广泛,但现有方法仍存在若干基本局限。尤为关键的是,我们缺乏严格的统计检验来判断两个高维单细胞数据集是否可对齐(从而决定是否应当进行对齐)。此外,流行方法在对齐过程中可能严重扭曲数据,导致对齐后的数据与下游分析难以解释。为克服这些局限,我们提出了一种谱流形对齐与推断(SMAI)框架,该框架能够实现原则性且可解释的对齐性检验以及保持结构的单细胞数据整合。SMAI提供了一种统计检验,可稳健地判定数据集间的可对齐性,从而避免误导性推断,其合理性由高维统计理论支撑。在多种真实与模拟基准数据集上,SMAI的表现优于常用对齐方法。此外,我们证明SMAI能改进差异表达基因识别、单细胞空间转录组学插补等多种下游分析,提供更深入的生物学见解。SMAI的可解释性还能量化并深入理解单细胞数据中技术混杂因素的来源。