Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets

Embedding high-dimensional data into a low-dimensional space is an indispensable component of data analysis. In numerous applications, it is necessary to align and jointly embed multiple datasets from different studies or experimental conditions. Such datasets may share underlying structures of interest but exhibit individual distortions, resulting in misaligned embeddings using traditional techniques. In this work, we propose \textit{Entropic Optimal Transport (EOT) eigenmaps}, a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees. Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure and align the datasets accordingly in a common embedding space. We interpret our approach as an inter-data variant of the classical Laplacian eigenmaps and diffusion maps embeddings, showing that it enjoys many favorable analogous properties. We then analyze a data-generative model where two observed high-dimensional datasets share latent variables on a common low-dimensional manifold, but each dataset is subject to data-specific translation, scaling, nuisance structures, and noise. We show that in a high-dimensional asymptotic regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables. Subsequently, we provide a geometric interpretation of our embedding by relating it to the eigenfunctions of population-level operators encoding the density and geometry of the shared manifold. Finally, we showcase the performance of our approach for data integration and embedding through simulations and analyses of real-world biological data, demonstrating its advantages over alternative methods in challenging scenarios.

翻译：将高维数据嵌入低维空间是数据分析不可或缺的组成部分。在许多应用中，需要将来自不同研究或实验条件的多个数据集进行对齐与联合嵌入。这类数据集可能共享潜在的目标结构，但各自存在特定畸变，导致传统技术产生的嵌入结果存在错位。本文提出\textit{熵最优传输特征映射}，这是一种具有理论保证的数据集对齐与联合嵌入原则性方法。该方法利用两个数据集间熵最优传输规划矩阵的主奇异向量，提取其共享的潜在结构，进而在公共嵌入空间中对齐数据集。我们将该方法解释为经典拉普拉斯特征映射与扩散映射嵌入的跨数据变体，证明其具备诸多类似的优良特性。随后，我们分析一个数据生成模型：其中两个观测到的高维数据集在公共低维流形上共享潜在变量，但每个数据集分别受到数据特定的平移、缩放、干扰结构及噪声影响。我们证明在高维渐近体系下，熵最优传输规划能通过逼近潜在变量位置处评估的核函数来恢复共享流形结构。进一步地，我们通过将所提嵌入方法与编码共享流形密度及几何特性的总体水平算子特征函数相关联，给出其几何解释。最后，我们通过仿真实验和真实生物数据分析，展示了所提方法在数据整合与嵌入任务中的性能，证明其在挑战性场景中相较于替代方法的优势。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日