Cluster separation is a task typically tackled by widely used clustering techniques, such as k-means or DBSCAN. However, these algorithms are based on non-perceptual metrics, and our experiments demonstrate that their output does not reflect human cluster perception. To bridge the gap between human cluster perception and machine-computed clusters, we propose HPSCAN, a learning strategy that operates directly on scattered data. To learn perceptual cluster separation on such data, we crowdsourced the labeling of 7,320 bivariate (scatterplot) datasets to 384 human participants. We train our HPSCAN model on these human-annotated data. Instead of rendering these data as scatterplot images, we used their x and y point coordinates as input to a modified PointNet++ architecture, enabling direct inference on point clouds. In this work, we provide details on how we collected our dataset, report statistics of the resulting annotations, and investigate the perceptual agreement of cluster separation for real-world data. We also report the training and evaluation protocol for HPSCAN and introduce a novel metric, that measures the accuracy between a clustering technique and a group of human annotators. We explore predicting point-wise human agreement to detect ambiguities. Finally, we compare our approach to ten established clustering techniques and demonstrate that HPSCAN is capable of generalizing to unseen and out-of-scope data.
翻译:聚类分离通常由广泛使用的聚类技术(如k-means或DBSCAN)处理。然而,这些算法基于非感知度量,我们的实验表明其输出未能反映人类的聚类感知。为弥合人类聚类感知与机器计算聚类之间的差距,我们提出HPSCAN——一种直接在散点数据上运行的学习策略。为学习此类数据上的感知聚类分离,我们通过众包方式由384名人类参与者标注了7,320个二元(散点图)数据集。基于这些人工标注数据,我们训练了HPSCAN模型。我们未将数据渲染为散点图图像,而是将点的x和y坐标作为改进版PointNet++架构的输入,从而实现对点云的直接推理。本研究详细阐述了数据集的构建过程,报告了标注结果的统计特征,并探究了真实数据在聚类分离任务上的感知一致性。我们同时说明了HPSCAN的训练与评估方案,并提出一种新颖的度量指标,用于量化聚类技术与人类标注群体之间的准确性差异。通过预测逐点的人类标注一致性,我们探索了模糊区域的检测方法。最后,我们将所提方法与十种经典聚类技术进行比较,证明HPSCAN能够泛化至未见数据及超出训练范围的数据。