Density-based clustering aims to find groups of similar objects (i.e., clusters) in a given dataset. Applications include, e.g., process mining and anomaly detection. It comes with two user parameters ({\epsilon}, MinPts) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. However, existing solutions suffer from the following limitations: (a) Ineffective pruning of expensive neighborhood computations. (b) Approximate clustering, where objects are falsely labeled noise. (c) Restricted parameter tuning that is limited to {\epsilon} whereas MinPts is constant, which reduces the explorable clusterings. (d) Inflexibility in terms of applicable data types and distance functions. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexities of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is effcient and flexible regarding data types and distance functions. Moreover, FINEX respects the original and straightforward notion of density-based clustering. In our experiments on 12 large real-world datasets from various domains, FINEX frequently outperforms state-of-the-art techniques for exact clustering by orders of magnitude.
翻译:密度聚类旨在发现给定数据集中相似对象的组(即簇),其应用包括流程挖掘和异常检测等。该方法依赖两个用户参数(ε、MinPts)决定聚类结果,但这些参数通常无法预先获知。因此,用户需要交互式测试不同参数设置,直至找到满意的聚类。然而,现有解决方案存在以下局限:(a) 对昂贵的邻域计算剪枝效率低下;(b) 近似聚类导致对象被错误标记为噪声;(c) 参数调优受限(仅可调节ε,而MinPts固定),从而减少了可探索的聚类模式;(d) 在适用数据类型和距离函数上缺乏灵活性。我们提出FINEX——一种克服上述局限的线性空间索引。该索引可提供精确聚类结果,并支持通过两个参数中的任意一个进行查询。FINEX在可能的情况下避免邻域计算,并通过利用密度聚类的核心性质降低剩余计算的复杂度。因此,我们的方案在数据类型和距离函数方面兼顾高效性与灵活性。此外,FINEX遵循密度聚类原始且直观的定义。在来自不同领域的12个大规模真实数据集实验中,FINEX在精确聚类任务中频繁以数量级优势超越现有技术。