Data from individual observations can originate from various sources or modalities but are often intrinsically linked. Multimodal data integration can enrich information content compared to single-source data. Manifold alignment is a form of data integration that seeks a shared, underlying low-dimensional representation of multiple data sources that emphasizes similarities between alternative representations of the same entities. Semi-supervised manifold alignment relies on partially known correspondences between domains, either through shared features or through other known associations. In this paper, we introduce two semi-supervised manifold alignment methods. The first method, Shortest Paths on the Union of Domains (SPUD), forms a unified graph structure using known correspondences to establish graph edges. By learning inter-domain geodesic distances, SPUD creates a global, multi-domain structure. The second method, MASH (Manifold Alignment via Stochastic Hopping), learns local geometry within each domain and forms a joint diffusion operator using known correspondences to iteratively learn new inter-domain correspondences through a random-walk approach. Through the diffusion process, MASH forms a coupling matrix that links heterogeneous domains into a unified structure. We compare SPUD and MASH with existing semi-supervised manifold alignment methods and show that they outperform competing methods in aligning true correspondences and cross-domain classification. In addition, we show how these methods can be applied to transfer label information between domains.
翻译:来自个体观测的数据可能源自不同来源或模态,但通常具有内在关联性。与单源数据相比,多模态数据集成能够丰富信息内容。流形对齐是一种数据集成形式,旨在为多个数据源寻找一个共享的底层低维表示,该表示强调同一实体不同表征之间的相似性。半监督流形对齐依赖于域间部分已知的对应关系,这些对应关系可通过共享特征或其他已知关联获得。本文提出了两种半监督流形对齐方法。第一种方法——域并集最短路径法(SPUD)——利用已知对应关系建立图边,形成统一图结构。通过学习域间测地距离,SPUD创建了全局的多域结构。第二种方法MASH(基于随机跳转的流形对齐)学习每个域内的局部几何结构,并利用已知对应关系构建联合扩散算子,通过随机游走方法迭代学习新的域间对应关系。通过扩散过程,MASH形成连接异构域的统一耦合矩阵。我们将SPUD和MASH与现有半监督流形对齐方法进行比较,结果表明它们在对齐真实对应关系和跨域分类任务中优于现有方法。此外,我们还展示了这些方法如何应用于域间标签信息迁移。