Visual clustering is a common perceptual task in scatterplots that supports diverse analytics tasks (e.g., cluster identification). However, even with the same scatterplot, the ways of perceiving clusters (i.e., conducting visual clustering) can differ due to the differences among individuals and ambiguous cluster boundaries. Although such perceptual variability casts doubt on the reliability of data analysis based on visual clustering, we lack a systematic way to efficiently assess this variability. In this research, we study perceptual variability in conducting visual clustering, which we call Cluster Ambiguity. To this end, we introduce CLAMS, a data-driven visual quality measure for automatically predicting cluster ambiguity in monochrome scatterplots. We first conduct a qualitative study to identify key factors that affect the visual separation of clusters (e.g., proximity or size difference between clusters). Based on study findings, we deploy a regression module that estimates the human-judged separability of two clusters. Then, CLAMS predicts cluster ambiguity by analyzing the aggregated results of all pairwise separability between clusters that are generated by the module. CLAMS outperforms widely-used clustering techniques in predicting ground truth cluster ambiguity. Meanwhile, CLAMS exhibits performance on par with human annotators. We conclude our work by presenting two applications for optimizing and benchmarking data mining techniques using CLAMS. The interactive demo of CLAMS is available at clusterambiguity.dev.
翻译:视觉聚类是散点图中常见的感知任务,可支持多种分析任务(例如,聚类识别)。然而,即使对于同一张散点图,由于个体差异和模糊的聚类边界,感知聚类(即进行视觉聚类)的方式也可能不同。尽管这种感知变异性对基于视觉聚类的数据分析的可靠性提出了质疑,但我们仍缺乏一种系统性的方法来有效评估这种变异性。在本研究中,我们探讨了进行视觉聚类时的感知变异性,称之为聚类模糊。为此,我们引入了CLAMS,一种数据驱动的视觉质量度量方法,用于自动预测单色散点图中的聚类模糊。我们首先进行了一项定性研究,以识别影响聚类视觉分离的关键因素(例如,聚类间的邻近性或大小差异)。基于研究结果,我们部署了一个回归模块,用于估计两个聚类的人类判断可分离性。然后,CLAMS通过分析该模块生成的所有聚类间成对可分离性的聚合结果来预测聚类模糊。在预测真实聚类模糊方面,CLAMS优于广泛使用的聚类技术。同时,CLAMS表现出与人类标注者相当的性能。我们通过展示两个使用CLAMS优化和基准测试数据挖掘技术的应用来总结我们的工作。CLAMS的交互式演示可在clusterambiguity.dev上获取。