DBSCAN is a popular density-based clustering algorithm that has many different applications in practice. However, the running time of DBSCAN in high-dimensional space or general metric space ({\em e.g.,} clustering a set of texts by using edit distance) can be as large as quadratic in the input size. Moreover, most of existing accelerating techniques for DBSCAN are only available for low-dimensional Euclidean space. In this paper, we study the DBSCAN problem under the assumption that the inliers (the core points and border points) have a low intrinsic dimension (which is a realistic assumption for many high-dimensional applications), where the outliers can locate anywhere in the space without any assumption. First, we propose a $k$-center clustering based algorithm that can reduce the time-consuming labeling and merging tasks of DBSCAN to be linear. Further, we propose a linear time approximate DBSCAN algorithm, where the key idea is building a novel small-size summary for the core points. Also, our algorithm can be efficiently implemented for streaming data and the required memory is independent of the input size. Finally, we conduct our experiments and compare our algorithms with several popular DBSCAN algorithms. The experimental results suggest that our proposed approach can significantly reduce the computational complexity in practice.
翻译:DBSCAN是一种流行的基于密度的聚类算法,在实践中具有多种不同的应用。然而,DBSCAN在高维空间或一般度量空间(例如,使用编辑距离对一组文本进行聚类)中的运行时间可能高达输入规模的平方量级。此外,现有的大多数DBSCAN加速技术仅适用于低维欧几里得空间。本文在假设内点(核心点与边界点)具有低本征维度(这对许多高维应用而言是一个现实的假设)的前提下研究DBSCAN问题,其中异常点可位于空间中任意位置且不受任何假设限制。首先,我们提出一种基于$k$-中心聚类的算法,可将DBSCAN中耗时的标记与合并任务降至线性复杂度。进一步,我们提出一种线性时间的近似DBSCAN算法,其核心思想是为核心点构建一种新颖的小规模摘要结构。同时,我们的算法可高效应用于流式数据,且所需内存与输入规模无关。最后,我们进行了实验,并将所提算法与多种主流DBSCAN算法进行比较。实验结果表明,我们提出的方法在实践中能显著降低计算复杂度。