Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets. Traditionally, this involves using dimensionality reduction (DR) methods to project data onto lower-dimensional spaces or organizing points into meaningful clusters (clustering). In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem. This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem. We empirically demonstrate its relevance to the identification of low-dimensional prototypes representing data at different scales, across multiple image and genomic datasets.
翻译:无监督学习旨在捕捉潜在大规模高维数据集的内在结构。传统方法通常采用维度约简(DR)技术将数据投影至低维空间,或将数据点组织为有意义的聚类簇(聚类)。本研究通过最优传输理论重新审视这些方法,并揭示其与Gromov-Wasserstein问题的关联。由此提出一个称为分布降维的新型通用框架,该框架将维度约简与聚类作为特例进行统一,并允许在单一优化问题中协同处理这两类任务。我们通过多组图像与基因组数据集的实验证明,该框架能有效识别表征不同尺度数据的低维原型。