Broadly, the goal when clustering data is to separate observations into meaningful subgroups. The rich variety of methods for clustering reflects the fact that the relevant notion of meaningful clusters varies across applications. The classical Bayesian approach clusters observations by their association with components of a mixture model; the choice in class of components allows flexibility to capture a range of meaningful cluster notions. However, in practice the range is somewhat limited as difficulties with computation and cluster identifiability arise as components are made more flexible. Instead of mixture component attribution, we consider clusterings that are functions of the data and the density $f$, which allows us to separate flexible density estimation from clustering. Within this framework, we develop a method to cluster data into connected components of a level set of $f$. Under mild conditions, we establish that our Bayesian level-set (BALLET) clustering methodology yields consistent estimates, and we highlight its performance in a variety of toy and simulated data examples. Finally, through an application to astronomical data we show the method performs favorably relative to the popular level-set clustering algorithm DBSCAN in terms of accuracy, insensitivity to tuning parameters, and quantification of uncertainty.
翻译:广泛而言,数据聚类的目标是将观测值划分为有意义的子群。聚类方法的丰富多样性反映了这样一个事实:在不同应用中,有意义的聚类概念也各不相同。经典的贝叶斯方法通过观测值与混合模型成分的关联来进行聚类;成分类别的选择提供了灵活性以捕捉多种有意义的聚类概念。然而在实践中,当成分变得更加灵活时,由于计算困难和聚类可识别性问题,这种范围会相对受限。本文不采用混合成分归属的思路,而是考虑依赖于数据和密度函数$f$的聚类方法,这使我们能够将灵活密度估计与聚类分离。在此框架下,我们开发了一种方法,可将数据聚类到$f$的某个水平集的连通分量中。在温和条件下,我们证明贝叶斯水平集(BALLET)聚类方法能够得到一致的估计量,并通过多种玩具数据和模拟数据实例展示了其性能。最后,通过在天文数据上的应用,我们表明该方法在准确性、对调参的敏感度以及不确定性量化方面,均优于流行的水平集聚类算法DBSCAN。