On Probabilistic Embeddings in Optimal Dimension Reduction

Dimension reduction algorithms are a crucial part of many data science pipelines, including data exploration, feature creation and selection, and denoising. Despite their wide utilization, many non-linear dimension reduction algorithms are poorly understood from a theoretical perspective. In this work we consider a generalized version of multidimensional scaling, which is posed as an optimization problem in which a mapping from a high-dimensional feature space to a lower-dimensional embedding space seeks to preserve either inner products or norms of the distribution in feature space, and which encompasses many commonly used dimension reduction algorithms. We analytically investigate the variational properties of this problem, leading to the following insights: 1) Solutions found using standard particle descent methods may lead to non-deterministic embeddings, 2) A relaxed or probabilistic formulation of the problem admits solutions with easily interpretable necessary conditions, 3) The globally optimal solutions to the relaxed problem actually must give a deterministic embedding. This progression of results mirrors the classical development of optimal transportation, and in a case relating to the Gromov-Wasserstein distance actually gives explicit insight into the structure of the optimal embeddings, which are parametrically determined and discontinuous. Finally, we illustrate that a standard computational implementation of this task does not learn deterministic embeddings, which means that it learns sub-optimal mappings, and that the embeddings learned in that context have highly misleading clustering structure, underscoring the delicate nature of solving this problem computationally.

翻译：降维算法是众多数据科学流程中的关键组成部分，涵盖数据探索、特征构建与选择以及去噪等环节。尽管其应用广泛，许多非线性降维算法在理论层面仍缺乏深入理解。本文研究一种广义的多维标度方法，该方法被构建为一个优化问题：通过从高维特征空间到低维嵌入空间的映射，试图保持特征空间中分布的内积或范数，并涵盖了多种常用降维算法。我们对此问题的变分性质进行了理论分析，得出以下核心发现：1) 使用标准粒子下降方法求得的解可能导致非确定性嵌入；2) 该问题的松弛或概率化表述允许具有易于解释的必要条件的解存在；3) 松弛问题的全局最优解实际上必然给出确定性嵌入。这一结论演进过程与最优传输理论的经典发展相呼应，并且在涉及Gromov-Wasserstein距离的案例中，明确揭示了最优嵌入的结构特征——这些嵌入由参数确定且具有不连续性。最后，我们通过实验证明，该任务的标准计算实现未能学习到确定性嵌入，这意味着其学习到的是次优映射，且在该背景下学得的嵌入具有高度误导性的聚类结构，从而凸显了在计算层面解决此问题的微妙性。