The scores of distance-based outlier detection methods are difficult to interpret, making it challenging to determine a cut-off threshold between normal and outlier data points without additional context. We describe a generic transformation of distance-based outlier scores into interpretable, probabilistic estimates. The transformation is ranking-stable and increases the contrast between normal and outlier data points. Determining distance relationships between data points is necessary to identify the nearest-neighbor relationships in the data, yet, most of the computed distances are typically discarded. We show that the distances to other data points can be used to model distance probability distributions and, subsequently, use the distributions to turn distance-based outlier scores into outlier probabilities. Our experiments show that the probabilistic transformation does not impact detection performance over numerous tabular and image benchmark datasets but results in interpretable outlier scores with increased contrast between normal and outlier samples. Our work generalizes to a wide range of distance-based outlier detection methods, and because existing distance computations are used, it adds no significant computational overhead.
翻译:距离-based异常点检测方法的得分难以解释,使得在没有额外背景信息的情况下确定正常数据点与异常数据点之间的阈值变得具有挑战性。我们提出了一种通用变换方法,将距离-based异常点得分转化为可解释的概率估计。该变换保持排序稳定性并增强了正常数据点与异常数据点之间的对比度。确定数据点间的距离关系是识别数据中近邻关系的必要条件,然而大多数计算出的距离通常被丢弃。我们证明,到其他数据点的距离可用于建模距离概率分布,进而利用这些分布将距离-based异常点得分转化为异常概率。实验表明,概率变换不会影响在多个表格与图像基准数据集上的检测性能,但能生成具有增强正常样本与异常样本对比度的可解释异常得分。我们的方法适用于广泛的距离-based异常点检测方法,且由于利用现有距离计算结果,不会显著增加计算开销。