DBSCAN is a popular density-based clustering algorithm that has many different applications in practice. However, the running time of DBSCAN in high-dimensional space or general metric space ({\em e.g.,} clustering a set of texts by using edit distance) can be as large as quadratic in the input size. Moreover, most of existing accelerating techniques for DBSCAN are only available for low-dimensional Euclidean space. In this paper, we study the DBSCAN problem under the assumption that the inliers (the core points and border points) have a low intrinsic dimension (which is a realistic assumption for many high-dimensional applications), where the outliers can locate anywhere in the space without any assumption. First, we propose a $k$-center clustering based algorithm that can reduce the time-consuming labeling and merging tasks of DBSCAN to be linear. Further, we propose a linear time approximate DBSCAN algorithm, where the key idea is building a novel small-size summary for the core points. Also, our algorithm can be efficiently implemented for streaming data and the required memory is independent of the input size. Finally, we conduct our experiments and compare our algorithms with several popular DBSCAN algorithms. The experimental results suggest that our proposed approach can significantly reduce the computational complexity in practice.
翻译:DBSCAN是一种流行的基于密度的聚类算法,在实践中具有诸多不同应用。然而,在高维空间或一般度量空间(例如使用编辑距离对文本集合进行聚类)中,DBSCAN的运行时间可能高达输入规模的二次方。此外,现有大多数DBSCAN加速技术仅适用于低维欧氏空间。本文在假设内点(核心点与边界点)具有低本征维度(这对许多高维应用是符合实际的假设)的前提下研究DBSCAN问题,其中异常点可位于空间任意位置且无需任何假设。首先,我们提出一种基于k中心聚类的算法,可将DBSCAN耗时的标记与合并任务降至线性复杂度。进一步地,我们提出一种线性时间的近似DBSCAN算法,其核心思想是为核心点构建一种新颖的小规模摘要。同时,该算法可高效应用于流式数据,且所需内存与输入规模无关。最后,我们通过实验将所提算法与多种主流DBSCAN算法进行比较。实验结果表明,我们提出的方法在实践中能显著降低计算复杂度。