Clustering is a data analysis method for extracting knowledge by discovering groups of data called clusters. Among these methods, state-of-the-art density-based clustering methods have proven to be effective for arbitrary-shaped clusters. Despite their encouraging results, they suffer to find low-density clusters, near clusters with similar densities, and high-dimensional data. Our proposals are a new characterization of clusters and a new clustering algorithm based on spatial density and probabilistic approach. First of all, sub-clusters are built using spatial density represented as probability density function ($p.d.f$) of pairwise distances between points. A method is then proposed to agglomerate similar sub-clusters by using both their density ($p.d.f$) and their spatial distance. The key idea we propose is to use the Wasserstein metric, a powerful tool to measure the distance between $p.d.f$ of sub-clusters. We show that our approach outperforms other state-of-the-art density-based clustering methods on a wide variety of datasets.
翻译:聚类是一种通过发现称为簇的数据群来提取知识的数据分析方法。在这些方法中,最先进的基于密度的聚类方法已被证明对任意形状的簇有效。尽管取得了令人鼓舞的结果,但这些方法在识别低密度簇、具有相似密度的邻近簇以及高维数据方面仍存在困难。我们的研究提出了一种新的簇特征描述方法以及一种基于空间密度和概率方法的新聚类算法。首先,利用点对之间距离的概率密度函数($p.d.f$)所表示的空间密度构建子簇。随后,提出了一种基于子簇的密度($p.d.f$)及其空间距离来聚合相似子簇的方法。我们提出的关键思想是使用Wasserstein度量——一种测量子簇间$p.d.f$距离的强大工具。实验表明,我们的方法在多种数据集上优于其他最先进的基于密度的聚类方法。