Unsupervised representation learning methods are widely used for gaining insight into high-dimensional, unstructured, or structured data. In some cases, users may have prior topological knowledge about the data, such as a known cluster structure or the fact that the data is known to lie along a tree- or graph-structured topology. However, generic methods to ensure such structure is salient in the low-dimensional representations are lacking. This negatively impacts the interpretability of low-dimensional embeddings, and plausibly downstream learning tasks. To address this issue, we introduce topological regularization: a generic approach based on algebraic topology to incorporate topological prior knowledge into low-dimensional embeddings. We introduce a class of topological loss functions, and show that jointly optimizing an embedding loss with such a topological loss function as a regularizer yields embeddings that reflect not only local proximities but also the desired topological structure. We include a self-contained overview of the required foundational concepts in algebraic topology, and provide intuitive guidance on how to design topological loss functions for a variety of shapes, such as clusters, cycles, and bifurcations. We empirically evaluate the proposed approach on computational efficiency, robustness, and versatility in combination with linear and non-linear dimensionality reduction and graph embedding methods.
翻译:无监督表示学习方法被广泛应用于洞察高维、非结构化或结构化数据。在某些情况下,用户可能拥有关于数据的先验拓扑知识,例如已知的聚类结构,或数据已知分布在树状或图状拓扑结构上。然而,目前缺乏确保此类结构在低维表示中凸显的通用方法,这负面影响低维嵌入的可解释性以及可能的下游学习任务。为解决此问题,我们引入拓扑正则化:一种基于代数拓扑的通用方法,用于将拓扑先验知识融入低维嵌入。我们定义了一类拓扑损失函数,并表明通过联合优化嵌入损失与此类拓扑损失函数作为正则化项,生成的嵌入不仅反映局部邻近性,还能体现所需的拓扑结构。我们包含一份自包含的基础代数拓扑概念概述,并针对多种形状(如聚类、循环和分叉)提供设计拓扑损失函数的直观指导。我们通过计算效率、鲁棒性和通用性,结合线性与非线性降维及图嵌入方法,对提出的方法进行了实证评估。