Embedding high-dimensional data into a low-dimensional space is an indispensable component of data analysis. In numerous applications, it is necessary to align and jointly embed multiple datasets from different studies or experimental conditions. Such datasets may share underlying structures of interest but exhibit individual distortions, resulting in misaligned embeddings using traditional techniques. In this work, we propose Entropic Optimal Transport (EOT) eigenmaps, a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees. Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure and align them in a common embedding space. We interpret our approach as an inter-data variant of the classical Laplacian eigenmaps and diffusion maps embeddings, showing that it enjoys many favorable analogous properties. We analyze a generative model in which two observed high-dimensional datasets share latent variables supported on a common low-dimensional manifold, while each dataset is subject to translation, geometric distortion, orthogonal nuisance structure, and noise. In a large-sample, high-dimensional regime, we prove that the EOT plan concentrates around a population kernel on an effective manifold determined by the geometric mean of the distortions, with invariance to translations, orthogonal nuisance structure, and noise. Subsequently, we relate our embedding to eigenfunctions of population-level operators encoding the density and geometry of the shared manifold. Finally, we showcase the performance of our approach for data integration and embedding through simulations and analyses of real-world biological data, demonstrating its advantages over alternative methods in challenging scenarios.
翻译:将高维数据嵌入低维空间是数据分析中不可或缺的环节。在众多应用中,我们需要对齐并联合嵌入来自不同研究或实验条件的多个数据集。这些数据集可能共享潜在的目标结构,但存在各自畸变,导致传统技术难以实现嵌入对齐。本文提出熵最优传输特征映射(EOT eigenmaps),这是一种具备理论保证的、用于成对数据集对齐与联合嵌入的规范化方法。该方法利用两个数据集间EOT规划矩阵的前导奇异向量,提取其共享潜在结构,并在共同嵌入空间中实现对齐。我们将其诠释为经典拉普拉斯特征映射和扩散映射嵌入的跨数据变体,并证明其具有诸多有利的相似性质。我们分析了一个生成模型,其中两个观测到的高维数据集共享底层潜在变量(这些变量共同支撑在一个低维流形上)的同时,各数据集均可能受平移、几何畸变、正交干扰结构和噪声的影响。在大样本高维场景下,我们证明了EOT规划会集中在由畸变几何平均数决定的有效流形上的群体核函数周围,且对平移、正交干扰结构和噪声具有不变性。随后,我们将嵌入与编码共享流形密度和几何特征的群体级算子特征函数相关联。最后,通过数值模拟和真实生物数据分析,展示了该方法在数据整合与嵌入中的性能,证明了其在复杂场景下相较于替代方法的优势。