Discrete distribution clustering (D2C) was often solved by Wasserstein barycenter methods. These methods are under a common assumption that clusters can be well represented by barycenters, which may not hold in many real applications. In this work, we propose a simple yet effective framework based on spectral clustering and distribution affinity measures (e.g., maximum mean discrepancy and Wasserstein distance) for D2C. To improve the scalability, we propose to use linear optimal transport to construct affinity matrices efficiently on large datasets. We provide theoretical guarantees for the success of the proposed methods in clustering distributions. Experiments on synthetic and real data show that our methods outperform the baselines largely in terms of both clustering accuracy and computational efficiency.
翻译:离散分布聚类(D2C)通常通过Wasserstein重心方法求解。这些方法基于一个常见假设,即聚类可以用重心进行良好表示,但在许多实际应用中这一假设可能不成立。本文提出了一种基于谱聚类与分布亲和度度量(如最大均值差异和Wasserstein距离)的简单而有效的框架,用于解决D2C问题。为提升可扩展性,我们提出利用线性最优输运在大规模数据集上高效构建亲和矩阵。我们从理论上证明了所提方法在分布聚类中成功成立的保证。在合成数据与真实数据上的实验表明,我们的方法在聚类精度和计算效率两方面均显著优于基线方法。