Clustering and outlier detection are two important tasks in data mining. Outliers frequently interfere with clustering algorithms to determine the similarity between objects, resulting in unreliable clustering results. Currently, only a few clustering algorithms (e.g., DBSCAN) have the ability to detect outliers to eliminate interference. For other clustering algorithms, it is tedious to introduce another outlier detection task to eliminate outliers before each clustering process. Obviously, how to equip more clustering algorithms with outlier detection ability is very meaningful. Although a common strategy allows clustering algorithms to detect outliers based on the distance between objects and clusters, it is contradictory to improving the performance of clustering algorithms on the datasets with outliers. In this paper, we propose a novel outlier detection approach, called ODAR, for clustering. ODAR maps outliers and normal objects into two separated clusters by feature transformation. As a result, any clustering algorithm can detect outliers by identifying clusters. Experiments show that ODAR is robust to diverse datasets. Compared with baseline methods, the clustering algorithms achieve the best on 7 out of 10 datasets with the help of ODAR, with at least 5% improvement in accuracy.
翻译:聚类与异常值检测是数据挖掘中的两项重要任务。异常值常干扰聚类算法对对象间相似性的判定,导致聚类结果不可靠。目前,仅有少数聚类算法(如DBSCAN)具备检测异常值以消除干扰的能力。对于其他聚类算法,在每次聚类过程前引入另一项异常值检测任务来消除异常值十分繁琐。显然,如何使更多聚类算法具备异常值检测能力具有重要意义。尽管常见策略允许聚类算法基于对象与簇之间的距离来检测异常值,但这与提升聚类算法在含异常值数据集上的性能相矛盾。本文提出一种新颖的面向聚类的异常值检测方法,称为ODAR。ODAR通过特征变换将异常值与正常对象映射至两个分离的簇中。因此,任何聚类算法均可通过识别簇来检测异常值。实验表明,ODAR对多样化数据集具有鲁棒性。与基线方法相比,在ODAR的辅助下,聚类算法在10个数据集中的7个上取得了最佳性能,准确率提升至少5%。