DBSCAN is a fundamental density-based clustering technique that identifies any arbitrary shape of the clusters. However, it becomes infeasible while handling big data. On the other hand, centroid-based clustering is important for detecting patterns in a dataset since unprocessed data points can be labeled to their nearest centroid. However, it can not detect non-spherical clusters. For a large data, it is not feasible to store and compute labels of every samples. These can be done as and when the information is required. The purpose can be accomplished when clustering act as a tool to identify cluster representatives and query is served by assigning cluster labels of nearest representative. In this paper, we propose an Incremental Prototype-based DBSCAN (IPD) algorithm which is designed to identify arbitrary-shaped clusters for large-scale data. Additionally, it chooses a set of representatives for each cluster.
翻译:摘要:DBSCAN是一种基础的基于密度的聚类技术,能够识别任意形状的簇。然而,在处理大数据时,该方法变得不可行。另一方面,基于质心的聚类通过将未处理数据点分配到最近的质心,有助于检测数据集中的模式,但无法识别非球形簇。对于大规模数据,存储和计算每个样本的标签是不切实际的,这些操作可根据需求按需执行。当聚类作为识别簇代表的工具,并通过分配最近代表的簇标签来响应查询时,上述目标即可实现。本文提出了一种基于增量原型的DBSCAN算法(IPD),旨在为大规模数据识别任意形状的簇,同时为每个簇选择一组代表点。