Cluster analysis plays a crucial role in database mining, and one of the most widely used algorithms in this field is DBSCAN. However, DBSCAN has several limitations, such as difficulty in handling high-dimensional large-scale data, sensitivity to input parameters, and lack of robustness in producing clustering results. This paper introduces an improved version of DBSCAN that leverages the block-diagonal property of the similarity graph to guide the clustering procedure of DBSCAN. The key idea is to construct a graph that measures the similarity between high-dimensional large-scale data points and has the potential to be transformed into a block-diagonal form through an unknown permutation, followed by a cluster-ordering procedure to generate the desired permutation. The clustering structure can be easily determined by identifying the diagonal blocks in the permuted graph. We propose a gradient descent-based method to solve the proposed problem. Additionally, we develop a DBSCAN-based points traversal algorithm that identifies clusters with high densities in the graph and generates an augmented ordering of clusters. The block-diagonal structure of the graph is then achieved through permutation based on the traversal order, providing a flexible foundation for both automatic and interactive cluster analysis. We introduce a split-and-refine algorithm to automatically search for all diagonal blocks in the permuted graph with theoretically optimal guarantees under specific cases. We extensively evaluate our proposed approach on twelve challenging real-world benchmark clustering datasets and demonstrate its superior performance compared to the state-of-the-art clustering method on every dataset.
翻译:聚类分析在数据库挖掘中扮演着关键角色,而其中最广泛使用的算法之一是DBSCAN。然而,DBSCAN存在若干局限性,例如难以处理高维大规模数据、对输入参数敏感,以及聚类结果缺乏鲁棒性。本文提出了一种改进的DBSCAN算法,利用相似图的分块对角特性来引导DBSCAN的聚类过程。其核心思想是构建一个衡量高维大规模数据点间相似性的图,该图具备通过未知置换转化为分块对角形式的潜力,随后通过簇排序过程生成所需的置换。通过识别置换后图中的对角分块,可轻松确定聚类结构。我们提出了基于梯度下降的方法来求解该问题。此外,我们开发了一种基于DBSCAN的点遍历算法,用于识别图中高密度簇并生成簇的增广排序。随后,基于遍历顺序的置换实现了图的分块对角结构,为自动化和交互式聚类分析提供了灵活的基础。我们引入了一种分割-精炼算法,在特定情况下以理论最优保证自动搜索置换后图中的所有对角分块。通过在十二个具有挑战性的真实世界基准聚类数据集上进行广泛评估,我们证明了所提方法在每个数据集上均优于当前最先进的聚类方法。