ClusterNet: A Perception-Based Clustering Model for Scattered Data

Visualizations for scattered data are used to make users understand certain attributes of their data by solving different tasks, e.g. correlation estimation, outlier detection, cluster separation. In this paper, we focus on the later task, and develop a technique that is aligned to human perception, that can be used to understand how human subjects perceive clusterings in scattered data and possibly optimize for better understanding. Cluster separation in scatterplots is a task that is typically tackled by widely used clustering techniques, such as for instance k-means or DBSCAN. However, as these algorithms are based on non-perceptual metrics, we can show in our experiments, that their output do not reflect human cluster perception. We propose a learning strategy which directly operates on scattered data. To learn perceptual cluster separation on this data, we crowdsourced a large scale dataset, consisting of 7,320 point-wise cluster affiliations for bivariate data, which has been labeled by 384 human crowd workers. Based on this data, we were able to train ClusterNet, a point-based deep learning model, trained to reflect human perception of cluster separability. In order to train ClusterNet on human annotated data, we use a PointNet++ architecture enabling inference on point clouds directly. In this work, we provide details on how we collected our dataset, report statistics of the resulting annotations, and investigate perceptual agreement of cluster separation for real-world data. We further report the training and evaluation protocol of ClusterNet and introduce a novel metric, that measures the accuracy between a clustering technique and a group of human annotators. Finally, we compare our approach against existing state-of-the-art clustering techniques and can show, that ClusterNet is able to generalize to unseen and out of scope data.

翻译：散点数据的可视化通过解决不同任务（如相关性估计、异常值检测、聚类分离）帮助用户理解数据的特定属性。本文聚焦于聚类分离任务，开发了一种与人类感知对齐的技术，可用于理解人类受试者如何感知散点数据中的聚类，并可能为优化理解提供支持。散点图中的聚类分离通常由广泛使用的聚类技术（如k-means或DBSCAN）处理。然而，由于这些算法基于非感知性度量，我们的实验表明其输出并不反映人类的聚类感知。我们提出了一种直接作用于散点数据的学习策略。为学习此类数据的感知性聚类分离，我们通过众包构建了一个大规模数据集，包含7,320个针对双变量数据的逐点聚类隶属标注，由384名众包工作者完成。基于此数据，我们训练了ClusterNet——一种逐点深度学习模型，旨在体现人类对聚类可分离性的感知。为在人工标注数据上训练ClusterNet，我们采用PointNet++架构直接对点云进行推理。本文详细介绍了数据集收集方法、标注统计结果，并探讨了真实世界中聚类分离的感知一致性。我们还报告了ClusterNet的训练与评估协议，并提出了一种新度量，用于衡量聚类技术与人工标注群体之间的一致性。最后，我们将方法现有最先进聚类技术进行对比，结果表明ClusterNet能够泛化至未见及领域外数据。