Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets. Traditionally, this involves using dimensionality reduction methods to project data onto interpretable spaces or organizing points into meaningful clusters. In practice, these methods are used sequentially, without guaranteeing that the clustering aligns well with the conducted dimensionality reduction. In this work, we offer a fresh perspective: that of distributions. Leveraging tools from optimal transport, particularly the Gromov-Wasserstein distance, we unify clustering and dimensionality reduction into a single framework called distributional reduction. This allows us to jointly address clustering and dimensionality reduction with a single optimization problem. Through comprehensive experiments, we highlight the versatility and interpretability of our method and show that it outperforms existing approaches across a variety of image and genomics datasets.
翻译:无监督学习旨在捕捉潜在的大规模高维数据集的内在结构。传统方法通过降维技术将数据投影到可解释的空间,或通过聚类将数据点组织成有意义的簇。然而,实践中这些方法往往被顺序使用,无法保证聚类结果与降维过程良好对齐。本研究提出一个全新视角:从分布出发。借助最优传输工具(尤其是Gromov-Wasserstein距离),我们将聚类与降维统一至名为"分布约简"的单一框架中,从而通过单个优化问题协同解决聚类与降维任务。通过大量实验,我们展示了该方法的多功能性与可解释性,并证明其在多种图像与基因组数据集上优于现有方法。