We hypothesize that similar objects should have similar outlier scores. To our knowledge, all existing outlier detectors calculate the outlier score for each object independently regardless of the outlier scores of the other objects. Therefore, they do not guarantee that similar objects have similar outlier scores. To verify our proposed hypothesis, we propose an outlier score post-processing technique for outlier detectors, called neighborhood averaging(NA), which pays attention to objects and their neighbors and guarantees them to have more similar outlier scores than their original scores. Given an object and its outlier score from any outlier detector, NA modifies its outlier score by combining it with its k nearest neighbors' scores. We demonstrate the effectivity of NA by using the well-known k-nearest neighbors (k-NN). Experimental results show that NA improves all 10 tested baseline detectors by 13% (from 0.70 to 0.79 AUC) on average evaluated on nine real-world datasets. Moreover, even outlier detectors that are already based on k-NN are also improved. The experiments also show that in some applications, the choice of detector is no more significant when detectors are jointly used with NA, which may pose a challenge to the generally considered idea that the data model is the most important factor. We open our code on www.outlierNet.com for reproducibility.
翻译:我们假设相似的对象应具有相似的异常分数。据我们所知,现有异常检测器均独立计算每个对象的异常分数,而不考虑其他对象的异常分数。因此,它们无法保证相似对象具有相似的异常分数。为验证这一假设,我们提出一种针对异常检测器的分数后处理技术——邻域平均(Neighborhood Averaging, NA),该技术关注对象及其邻域,确保它们拥有比原始分数更相似的异常分数。给定任意异常检测器输出的对象及其异常分数,NA通过将该分数与其k个最近邻的分数相结合来修正原始分数。我们使用经典的k近邻(k-NN)方法验证了NA的有效性。实验结果表明,在10个基准检测器上,NA平均提升了13%的检测性能(AUC从0.70提升至0.79),该结果基于9个真实数据集评估得出。此外,即便基于k-NN的异常检测器本身也能通过NA获得改进。实验还显示,当检测器与NA联合使用时,某些应用中检测器的选择不再至关重要,这对“数据模型是最重要因素”这一普遍认知提出了挑战。我们在www.outlierNet.com开源代码以确保结果可复现。