Wasserstein distances form a family of metrics on spaces of probability measures that have recently seen many applications. However, statistical analysis in these spaces is complex due to the nonlinearity of Wasserstein spaces. One potential solution to this problem is Linear Optimal Transport (LOT). This method allows one to find a Euclidean embedding, called LOT embedding, of measures in some Wasserstein spaces, but some information is lost in this embedding. So, to understand whether statistical analysis relying on LOT embeddings can make valid inferences about original data, it is helpful to quantify how well these embeddings describe that data. To answer this question, we present a decomposition of the Fr\'echet variance of a set of measures in the 2-Wasserstein space, which allows one to compute the percentage of variance explained by LOT embeddings of those measures. We then extend this decomposition to the Fused Gromov-Wasserstein setting. We also present several experiments that explore the relationship between the dimension of the LOT embedding, the percentage of variance explained by the embedding, and the classification accuracy of machine learning classifiers built on the embedded data. We use the MNIST handwritten digits dataset, IMDB-50000 dataset, and Diffusion Tensor MRI images for these experiments. Our results illustrate the effectiveness of low dimensional LOT embeddings in terms of the percentage of variance explained and the classification accuracy of models built on the embedded data.
翻译:Wasserstein距离构成了概率测度空间上的一族度量,近年来获得了广泛的应用。然而,由于Wasserstein空间的非线性特性,在这些空间中进行统计分析较为复杂。线性最优传输(LOT)是解决该问题的一种潜在方法。该方法能够在某些Wasserstein空间中为测度找到一种称为LOT嵌入的欧几里得嵌入,但在此嵌入过程中会损失部分信息。因此,为了理解基于LOT嵌入的统计分析能否对原始数据作出有效推断,量化这些嵌入对数据的描述能力至关重要。针对该问题,本文提出了2-Wasserstein空间中测度集合的Fr\'echet方差分解方法,通过该方法可计算LOT嵌入所解释的方差百分比。随后,我们将该分解方法扩展至融合Gromov-Wasserstein框架。通过多组实验,我们探究了LOT嵌入维度、嵌入解释方差百分比与基于嵌入数据构建的机器学习分类器准确率之间的关系。实验采用MNIST手写数字数据集、IMDB-50000数据集及扩散张量磁共振成像图像。实验结果表明,低维LOT嵌入在方差解释百分比和基于嵌入数据构建模型的分类准确率方面均表现出良好效果。