DBSCAN is a popular density-based clustering algorithm that has many different applications in practice. However, the running time of DBSCAN in high-dimensional space or general metric space ({\em e.g.,} clustering a set of texts by using edit distance) can be as large as quadratic in the input size. Moreover, most of existing accelerating techniques for DBSCAN are only available for low-dimensional Euclidean space. In this paper, we study the DBSCAN problem under the assumption that the inliers (the core points and border points) have a low intrinsic dimension (which is a realistic assumption for many high-dimensional applications), where the outliers can locate anywhere in the space without any assumption. First, we propose a $k$-center clustering based algorithm that can reduce the time-consuming labeling and merging tasks of DBSCAN to be linear. Further, we propose a linear time approximate DBSCAN algorithm, where the key idea is building a novel small-size summary for the core points. Also, our algorithm can be efficiently implemented for streaming data and the required memory is independent of the input size. Finally, we conduct our experiments and compare our algorithms with several popular DBSCAN algorithms. The experimental results suggest that our proposed approach can significantly reduce the computational complexity in practice.
翻译:DBSCAN是一种流行的基于密度的聚类算法,在实际中有许多不同的应用。然而,在高维空间或一般度量空间(例如,使用编辑距离对一组文本进行聚类)中,DBSCAN的运行时间可能达到输入规模的二次方。此外,现有的DBSCAN加速技术大多仅适用于低维欧氏空间。在本文中,我们研究了在假设内点(核心点和边界点)具有低内在维度(这对许多高维应用是合理的假设)的前提下,且外点可位于空间中任意位置而无需假设的情况下的DBSCAN问题。首先,我们提出了一种基于k-中心聚类的算法,该算法可将DBSCAN耗时的标记和合并任务简化为线性复杂度。其次,我们提出了一种线性时间的近似DBSCAN算法,其关键思想是为核心点构建一种新颖的小规模摘要。此外,我们的算法可以高效地实现流式数据处理,且所需内存与输入规模无关。最后,我们进行了实验,并将我们的算法与几种流行的DBSCAN算法进行了比较。实验结果表明,我们提出的方法在实际应用中能够显著降低计算复杂度。