Clustering is a pivotal challenge in unsupervised machine learning and is often investigated through the lens of mixture models. The optimal error rate for recovering cluster labels in Gaussian and sub-Gaussian mixture models involves ad hoc signal-to-noise ratios. Simple iterative algorithms, such as Lloyd's algorithm, attain this optimal error rate. In this paper, we first establish a universal lower bound for the error rate in clustering any mixture model, expressed through a Chernoff divergence, a more versatile measure of model information than signal-to-noise ratios. We then demonstrate that iterative algorithms attain this lower bound in mixture models with sub-exponential tails, notably emphasizing location-scale mixtures featuring Laplace-distributed errors. Additionally, for datasets better modelled by Poisson or Negative Binomial mixtures, we study mixture models whose distributions belong to an exponential family. In such mixtures, we establish that Bregman hard clustering, a variant of Lloyd's algorithm employing a Bregman divergence, is rate optimal.
翻译:聚类是无监督机器学习中的关键挑战,通常通过混合模型的视角进行研究。在高斯和亚高斯混合模型中,恢复聚类标签的最优误差率涉及特定的信噪比。简单迭代算法(如Lloyd算法)可达到这一最优误差率。本文首先通过Chernoff散度建立了任意混合模型聚类误差率的通用下界——相较于信噪比,这是一种更通用的模型信息度量。随后我们证明,在具有亚指数尾部的混合模型中(重点强调包含拉普拉斯分布误差的位置-尺度混合模型),迭代算法能达到该下界。此外,对于更适合用泊松或负二项混合模型建模的数据集,我们研究了分布属于指数族的混合模型。在此类混合模型中,我们证明采用Bregman散度的Lloyd算法变体——Bregman硬聚类具有速率最优性。