Data-driven Riemannian geometry has emerged as a powerful tool for interpretable representation learning, offering improved efficiency in downstream tasks. Moving forward, it is crucial to balance cheap manifold mappings with efficient training algorithms. In this work, we integrate concepts from pullback Riemannian geometry and generative models to propose a framework for data-driven Riemannian geometry that is scalable in both geometry and learning: score-based pullback Riemannian geometry. Focusing on unimodal distributions as a first step, we propose a score-based Riemannian structure with closed-form geodesics that pass through the data probability density. With this structure, we construct a Riemannian autoencoder (RAE) with error bounds for discovering the correct data manifold dimension. This framework can naturally be used with anisotropic normalizing flows by adopting isometry regularization during training. Through numerical experiments on various datasets, we demonstrate that our framework not only produces high-quality geodesics through the data support, but also reliably estimates the intrinsic dimension of the data manifold and provides a global chart of the manifold, even in high-dimensional ambient spaces.
翻译:数据驱动的黎曼几何已成为可解释表示学习的强大工具,在下游任务中展现出更高的效率。展望未来,平衡廉价的流形映射与高效的训练算法至关重要。本研究整合了拉回黎曼几何与生成模型的概念,提出了一种在几何与学习两方面均具可扩展性的数据驱动黎曼几何框架:基于得分的拉回黎曼几何。作为初步探索,我们聚焦于单峰分布,提出了一种具有闭式测地线的基于得分黎曼结构,该测地线能够穿过数据概率密度。基于此结构,我们构建了具有误差界的黎曼自编码器(RAE),用于发现正确的数据流形维度。该框架通过训练过程中采用等距正则化,可自然地与各向异性归一化流结合使用。通过在多种数据集上的数值实验,我们证明该框架不仅能在数据支撑集上生成高质量的测地线,还能可靠地估计数据流形的本征维度,并提供流形的全局坐标图,即使在高维环境空间中亦能保持有效性。