Density-based clustering is a commonly used tool in data science. Today many data science works are utilizing high-dimensional neural embeddings. However, traditional density-based clustering techniques like DBSCAN have a degraded performance on high-dimensional data. In this paper, we propose LAF, a generic learned accelerator framework to speed up the original DBSCAN and the sampling-based variants of DBSCAN on high-dimensional data with angular distance metric. This framework consists of a learned cardinality estimator and a post-processing module. The cardinality estimator can fast predict whether a data point is core or not to skip unnecessary range queries, while the post-processing module detects the false negative predictions and merges the falsely separated clusters. The evaluation shows our LAF-enhanced DBSCAN method outperforms the state-of-the-art efficient DBSCAN variants on both efficiency and quality.
翻译:密度聚类是数据科学中常用的工具。当前许多数据科学工作利用高维神经嵌入表示,然而传统的密度聚类技术(如DBSCAN)在高维数据上表现欠佳。本文提出LAF——一种通用型学习加速器框架,用于加速基于角度距离度量的原始DBSCAN及其采样变体在高维数据上的运行。该框架由学习型基数估计器与后处理模块构成:基数估计器可快速预测数据点是否为核心点,从而跳过不必要的范围查询;后处理模块则检测假阴性预测并合并被错误分离的聚类。评估表明,经LAF增强的DBSCAN方法在效率和质量上均优于现有高效DBSCAN变体。