DBSCAN is one of the most important non-parametric unsupervised data analysis tools. By applying DBSCAN to a dataset, two key analytical results can be obtained: (1) clustering data points based on density distribution and (2) identifying outliers in the dataset. However, the time complexity of the DBSCAN algorithm is $O(n^2 \beta)$, where $n$ is the number of data points and $\beta = O(D)$, with $D$ representing the dimensionality of the data space. As a result, DBSCAN becomes computationally infeasible when both $n$ and $D$ are large. In this paper, we propose a DBSCAN method based on spectral data compression, capable of efficiently processing datasets with a large number of data points ($n$) and high dimensionality ($D$). By preserving only the most critical structural information during the compression process, our method effectively removes substantial redundancy and noise. Consequently, the solution quality of DBSCAN is significantly improved, enabling more accurate and reliable results.
翻译:DBSCAN是最重要的非参数化无监督数据分析工具之一。通过对数据集应用DBSCAN,可以获得两个关键的分析结果:(1) 基于密度分布对数据点进行聚类;(2) 识别数据集中的异常值。然而,DBSCAN算法的时间复杂度为$O(n^2 \beta)$,其中$n$为数据点数量,$\beta = O(D)$,$D$表示数据空间的维度。因此,当$n$和$D$均较大时,DBSCAN在计算上变得不可行。本文提出一种基于谱数据压缩的DBSCAN方法,能够高效处理具有大量数据点($n$)和高维度($D$)的数据集。通过在压缩过程中仅保留最关键的结构信息,我们的方法有效去除了大量冗余和噪声。因此,DBSCAN的求解质量得到显著提升,能够获得更准确、更可靠的结果。