Embedding high-dimensional data into a low-dimensional space is an indispensable component of data analysis. In numerous applications, it is necessary to align and jointly embed multiple datasets from different studies or experimental conditions. Such datasets may share underlying structures of interest but exhibit individual distortions, resulting in misaligned embeddings using traditional techniques. In this work, we propose \textit{Entropic Optimal Transport (EOT) eigenmaps}, a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees. Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure and align the datasets accordingly in a common embedding space. We interpret our approach as an inter-data variant of the classical Laplacian eigenmaps and diffusion maps embeddings, showing that it enjoys many favorable analogous properties. We then analyze a data-generative model where two observed high-dimensional datasets share latent variables on a common low-dimensional manifold, but each dataset is subject to data-specific translation, scaling, nuisance structures, and noise. We show that in a high-dimensional asymptotic regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables. Subsequently, we provide a geometric interpretation of our embedding by relating it to the eigenfunctions of population-level operators encoding the density and geometry of the shared manifold. Finally, we showcase the performance of our approach for data integration and embedding through simulations and analyses of real-world biological data, demonstrating its advantages over alternative methods in challenging scenarios.
翻译:将高维数据嵌入低维空间是数据分析不可或缺的组成部分。在许多应用中,需要将来自不同研究或实验条件的多个数据集进行对齐与联合嵌入。这类数据集可能共享潜在的目标结构,但各自存在特定畸变,导致传统技术产生的嵌入结果存在错位。本文提出\textit{熵最优传输特征映射},这是一种具有理论保证的数据集对齐与联合嵌入原则性方法。该方法利用两个数据集间熵最优传输规划矩阵的主奇异向量,提取其共享的潜在结构,进而在公共嵌入空间中对齐数据集。我们将该方法解释为经典拉普拉斯特征映射与扩散映射嵌入的跨数据变体,证明其具备诸多类似的优良特性。随后,我们分析一个数据生成模型:其中两个观测到的高维数据集在公共低维流形上共享潜在变量,但每个数据集分别受到数据特定的平移、缩放、干扰结构及噪声影响。我们证明在高维渐近体系下,熵最优传输规划能通过逼近潜在变量位置处评估的核函数来恢复共享流形结构。进一步地,我们通过将所提嵌入方法与编码共享流形密度及几何特性的总体水平算子特征函数相关联,给出其几何解释。最后,我们通过仿真实验和真实生物数据分析,展示了所提方法在数据整合与嵌入任务中的性能,证明其在挑战性场景中相较于替代方法的优势。