In this paper, we address the classification of instances represented by distributions on a vector space rather than single points. We consider classification algorithms based on pairwise distances, specifically, the Wasserstein metric between distributions. Central to our investigation is dimension reduction within the Wasserstein metric space to enhance classification accuracy. We introduce a novel approach grounded in the principle of maximizing Fisher's ratio, defined as the quotient of between-class variation to within-class variation. The directions in which this ratio is maximized are termed discriminant coordinates or canonical variates axes. In practice, both between-class and within-class variations are defined as the average squared Wasserstein distances between pairs of distributions, with the pairs either belonging to the same class or to different classes. This ratio optimization is achieved through an iterative algorithm, which alternates between optimal transport and maximization steps within the vector space. Empirical studies are conducted to assess the algorithm's convergence; and experimental results demonstrate that the dimension reduction technique substantially enhances classification performance. Moreover, the new method outperforms well-established algorithms that operate on vector representations derived from distributional data. It also exhibits robustness to variations in how instances are summarized by distributions, such as the number of components in a Gaussian mixture model (GMM) representation.
翻译:本文针对由向量空间上的分布(而非单个点)表示实例的分类问题展开研究。我们考虑基于成对距离的分类算法,具体采用分布间的Wasserstein度量。核心研究在于通过Wasserstein度量空间中的降维来提高分类精度。我们提出了一种基于最大化Fisher准则(定义为类间变异与类内变异之比)的新型方法。该比值达到最大化的方向被称为判别坐标或典型变量轴。实际应用中,类间变异与类内变异均定义为分布对之间平均平方Wasserstein距离,其中分布对可属于同类或异类。该比值优化通过迭代算法实现,该算法在最优传输与向量空间内的最大化步骤之间交替进行。通过实证研究评估了算法的收敛性,实验结果表明降维技术显著提升了分类性能。此外,该方法优于基于分布数据向量表示的传统成熟算法,并对实例分布概括方式(如高斯混合模型表示中的分量数量)的变化展现出鲁棒性。