Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.
翻译:单细胞数据整合可提供细胞的全面分子视角,目前已开发多种算法用于消除非预期的技术或生物变异并整合异质性单细胞数据集。尽管被广泛使用,现有方法仍存在若干根本性局限。具体而言,我们缺乏严格的统计检验来判断两个高维单细胞数据集是否可对齐(进而决定是否应该进行对齐)。此外,常用方法在对其过程中可能会显著扭曲数据,导致对齐后数据及下游分析难以解释。为克服这些局限,我们提出谱流形对齐与推断(SMAI)框架,该框架针对具有相同特征类型的单细胞数据,实现了原理性且可解释的可对齐性检验与结构保持型整合。SMAI提供稳健评估数据集间可对齐性的统计检验方法,可避免误导性推断,并基于高维统计理论给出理论保证。在多种真实与模拟基准数据集上,SMAI优于常用对齐方法。进一步研究表明,SMAI能改进差异表达基因鉴定、单细胞空间转录组学插补等下游分析,提供更深入的生物学见解。SMAI的可解释性还能够量化并深化理解单细胞数据中技术混杂因素的来源。