We introduce Lumbermark, a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. Lumbermark iteratively chops off large limbs connected by protruding segments of a dataset's mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm can be viewed as an alternative to HDBSCAN that produces partitions with user-specified sizes. A fast, easy-to-use implementation of the new method is available in the open-source 'lumbermark' package for Python and R. We show that Lumbermark performs well on benchmark data and hope it will prove useful to data scientists and practitioners across different fields.
翻译:我们提出Lumbermark,一种鲁棒的划分式聚类算法,能够检测不同规模、密度和形状的簇。Lumbermark通过迭代方式,切掉数据集的互达最小生成树中由突出片段连接的大分支。互达距离的使用平滑了数据分布,并降低了低密度对象(如簇间噪声点或簇外围离群点)的影响。该算法可被视为HDBSCAN的一种替代方案,能够生成用户指定规模的划分。新方法的快速易用实现已作为开源‘lumbermark’包提供,支持Python和R语言。我们证明了Lumbermark在基准数据上表现良好,并希望它对不同领域的数据科学家和从业者有所帮助。