Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.
翻译:聚类,或称无监督分类,是一项常受离群点困扰的任务。然而,关于处理聚类中离群点的研究却相对匮乏。离群点识别算法大致可分为三类:离群点纳入法、离群点修剪法以及事后离群点识别法,其中前两类通常需要预先指定离群点的数量。本文利用样本马氏距离平方服从贝塔分布这一事实,推导了子集有限高斯混合模型对数似然的近似分布。随后提出一种算法,该算法根据子集对数似然(被视作离群点)逐步移除最不可信的点,直至子集对数似然符合参考分布。这形成了一种称为OCLUST的修剪方法,其能够内在估计离群点的数量。