Integrative analysis of multiple heterogeneous datasets has become standard practice in many research fields, especially in single-cell genomics and medical informatics. Existing approaches oftentimes suffer from limited power in capturing nonlinear structures, insufficient account of noisiness and effects of high-dimensionality, lack of adaptivity to signals and sample sizes imbalance, and their results are sometimes difficult to interpret. To address these limitations, we propose a novel kernel spectral method that achieves joint embeddings of two independently observed high-dimensional noisy datasets. The proposed method automatically captures and leverages possibly shared low-dimensional structures across datasets to enhance embedding quality. The obtained low-dimensional embeddings can be utilized for many downstream tasks such as simultaneous clustering, data visualization, and denoising. The proposed method is justified by rigorous theoretical analysis. Specifically, we show the consistency of our method in recovering the low-dimensional noiseless signals, and characterize the effects of the signal-to-noise ratios on the rates of convergence. Under a joint manifolds model framework, we establish the convergence of ultimate embeddings to the eigenfunctions of some newly introduced integral operators. These operators, referred to as duo-landmark integral operators, are defined by the convolutional kernel maps of some reproducing kernel Hilbert spaces (RKHSs). These RKHSs capture the either partially or entirely shared underlying low-dimensional nonlinear signal structures of the two datasets. Our numerical experiments and analyses of two single-cell omics datasets demonstrate the empirical advantages of the proposed method over existing methods in both embeddings and several downstream tasks.
翻译:多源异质数据集的整合分析已成为许多研究领域的标准实践,尤其在单细胞基因组学和医学信息学中。现有方法常存在对非线性结构捕捉能力有限、对噪声和高维效应考虑不足、缺乏对信号与样本量失衡的适应性,且结果难以解读等问题。为解决这些局限,我们提出一种新颖的核谱方法,实现了两个独立观测的高维含噪数据集的联合嵌入。该方法自动捕捉并利用数据集间可能共享的低维结构以提升嵌入质量,所获得的低维嵌入可用于同步聚类、数据可视化和去噪等下游任务。通过严格的理论分析验证了该方法的合理性:我们证明了方法在恢复低维无噪声信号方面的一致性,并刻画了信噪比对收敛速度的影响。在联合流形模型框架下,我们建立了最终嵌入与新引入积分算子特征函数之间的收敛性。这些称为"双地标积分算子"的算子由再生核希尔伯特空间(RKHS)的卷积核映射定义,其RKHS能捕捉两数据集中部分或完全共享的底层低维非线性信号结构。数值实验及两个单细胞组学数据集的分析表明,本方法在嵌入效果及多项下游任务中均优于现有方法。