The scores of distance-based outlier detection methods are difficult to interpret, making it challenging to determine a cut-off threshold between normal and outlier data points without additional context. We describe a generic transformation of distance-based outlier scores into interpretable, probabilistic estimates. The transformation is ranking-stable and increases the contrast between normal and outlier data points. Determining distance relationships between data points is necessary to identify the nearest-neighbor relationships in the data, yet, most of the computed distances are typically discarded. We show that the distances to other data points can be used to model distance probability distributions and, subsequently, use the distributions to turn distance-based outlier scores into outlier probabilities. Our experiments show that the probabilistic transformation does not impact detection performance over numerous tabular and image benchmark datasets but results in interpretable outlier scores with increased contrast between normal and outlier samples. Our work generalizes to a wide range of distance-based outlier detection methods, and because existing distance computations are used, it adds no significant computational overhead.
翻译:基于距离的异常检测方法所得分数难以解释,这使得在缺乏额外上下文的情况下,确定正常数据点与异常数据点之间的截断阈值颇具挑战性。我们描述了一种通用转换方法,可将基于距离的异常分数转换为可解释的概率估计。该转换保持排序稳定性,并增强了正常与异常数据点之间的对比度。确定数据点间的距离关系是识别数据中最近邻关系的必要条件,然而,大多数计算得到的距离通常被丢弃。我们证明,可利用与其他数据点之间的距离来建模距离概率分布,并进而利用这些分布将基于距离的异常分数转换为异常概率。实验表明,在多个表格和图像基准数据集上,该概率化转换不影响检测性能,但能产生可解释的异常分数,且正常样本与异常样本之间的对比度得以增强。我们的工作可推广至多种基于距离的异常检测方法,并且由于利用了已有的距离计算结果,因此不会显著增加计算开销。