Rate-distortion theory-based outlier detection builds upon the rationale that a good data compression will encode outliers with unique symbols. Based on this rationale, we propose Cluster Purging, which is an extension of clustering-based outlier detection. This extension allows one to assess the representivity of clusterings, and to find data that are best represented by individual unique clusters. We propose two efficient algorithms for performing Cluster Purging, one being parameter-free, while the other algorithm has a parameter that controls representivity estimations, allowing it to be tuned in supervised setups. In an experimental evaluation, we show that Cluster Purging improves upon outliers detected from raw clusterings, and that Cluster Purging competes strongly against state-of-the-art alternatives.
翻译:基于率失真理论的异常检测建立在如下原理之上:良好的数据压缩会将异常值编码为独特的符号。基于这一原理,我们提出了聚类净化方法,这是对基于聚类的异常检测的扩展。该扩展方法能够评估聚类结果的代表性,并发现那些最适合由单独独特聚类表示的数据。我们提出了两种高效的聚类净化算法:一种无需参数,另一种则通过参数控制代表性估计值,使其可在监督场景下进行调整。实验评估表明,聚类净化能够改善从原始聚类结果中检测出的异常值,并且与当前最优的替代方法相比具有显著竞争力。