Cluster separation in scatterplots is a task that is typically tackled by widely used clustering techniques, such as for instance k-means or DBSCAN. However, as these algorithms are based on non-perceptual metrics, their output often does not reflect human cluster perception. To bridge the gap between human cluster perception and machine-computed clusters, we propose a learning strategy which directly operates on scattered data. To learn perceptual cluster separation on this data, we crowdsourced a large scale dataset, consisting of 7,320 point-wise cluster affiliations for bivariate data, which has been labeled by 384 human crowd workers. Based on this data, we were able to train ClusterNet, a point-based deep learning model, trained to reflect human perception of cluster separability. In order to train ClusterNet on human annotated data, we omit rendering scatterplots on a 2D canvas, but rather use a PointNet++ architecture enabling inference on point clouds directly. In this work, we provide details on how we collected our dataset, report statistics of the resulting annotations, and investigate perceptual agreement of cluster separation for real-world data. We further report the training and evaluation protocol of ClusterNet and introduce a novel metric, that measures the accuracy between a clustering technique and a group of human annotators. Finally, we compare our approach against existing state-of-the-art clustering techniques.
翻译:散点图中聚类分离是一个通常由k-means或DBSCAN等广泛使用的聚类技术处理的任务。然而,由于这些算法基于非感知度量,它们的输出往往无法反映人类对聚类的直觉感知。为弥合人类聚类感知与机器计算聚类之间的差距,我们提出了一种直接作用于散点数据的学习策略。为了学习此类数据上的感知聚类分离,我们通过众包方式构建了一个大规模数据集,包含7,320个针对双变量数据的点级聚类归属标注,并由384名众包工作者完成标注。基于这些数据,我们训练了ClusterNet——一种基于点的深度学习模型,旨在反映人类对聚类可分性的感知。为在人类标注数据上训练ClusterNet,我们省略了在二维画布上渲染散点图,而是采用PointNet++架构直接对点云进行推理。本文详细介绍了数据集的收集方法、标注统计结果,并探究了真实世界数据中聚类分离的感知一致性。我们还报告了ClusterNet的训练与评估协议,并引入了一种新度量标准,用于衡量聚类技术与人类标注者群体之间的一致性。最后,我们将所提方法与现有最先进聚类技术进行了比较。