Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and \textit{post hoc} outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.
翻译:聚类(或称无监督分类)是一项常受异常值干扰的任务,然而针对聚类中异常值处理的研究尚显不足。现有异常值识别算法大致可分为三类:异常值包含法、异常值修剪法及事后异常值识别法,前两类方法通常需要预先指定异常值数量。基于样本马氏距离服从贝塔分布这一性质,本文推导出子集有限高斯混合模型对数似然值的近似分布。继而提出一种算法,通过剔除根据子集对数似然值认定为最不可靠的异常数据点,直至子集对数似然值符合参考分布。由此形成一种无需预设异常值数量即可自动估计异常值个数的修剪方法,命名为OCLUST。