Unsupervised classification, or clustering, is a problem often plagued by outliers, yet there is a paucity of work on handling outliers in unsupervised classification. Outlier algorithms tend to fall into two broad categories: outlier inclusion methods and trimming methods, which often require pre-specification of the number of points to remove. The fact that sample Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is proposed that removes the least likely points, which are deemed outliers, until the log-likelihoods adhere to the reference distribution. This results in a trimming method which inherently estimates the number of outliers present.
翻译:无监督分类(即聚类)常常受到离群值的困扰,但关于如何在无监督分类中处理离群值的研究却相对匮乏。离群值处理方法通常分为两大类:离群值包含方法和修剪方法,这两类方法往往需要预先指定要删除的数据点数量。基于样本马氏距离服从贝塔分布这一性质,推导出子集有限高斯混合模型对数似然值的近似分布。本文提出一种算法,通过逐步移除被判定为离群值的最低似然数据点,直至对数似然值符合参考分布为止。该方法是一种能够自动估计离群值数量的修剪方法。