A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/CNS
翻译:本文提出了一种新颖的聚类问题表述方式,将聚类任务转化为一个估计问题,其目标估计对象是将数据点映射至其簇隶属度分布的函数。与高斯混合模型等现有方法(此类方法隐式地估计此类函数)不同,所提出的方法绕过了任何显式的建模假设,并利用了非参数平滑在估计方面的灵活潜力。本文提供了一种直观的方法来选择控制估计的调优参数,这使得所提出的方法能够自动确定适当的灵活性水平,并从给定数据集中自动提取合适的簇数量。通过在大量公开可用数据集上进行实验,我们记录了所提出方法的优异性能,并与文献中的相关基准方法进行了比较。实现所提出方法的R代码可从 https://github.com/DavidHofmeyr/CNS 获取。