Unsupervised Outlier Detection (UOD) is a critical task in data mining and machine learning, aiming to identify instances that significantly deviate from the majority. Without any label, deep UOD methods struggle with the misalignment between the model's direct optimization goal and the final performance goal of Outlier Detection (OD) task. Through the perspective of training dynamics, this paper proposes an early stopping algorithm to optimize the training of deep UOD models, ensuring they perform optimally in OD rather than overfitting the entire contaminated dataset. Inspired by UOD mechanism and inlier priority phenomenon, where intuitively models fit inliers more quickly than outliers, we propose GradStop, a sampling-based label-free algorithm to estimate model's real-time performance during training. First, a sampling method generates two sets: one likely containing more outliers and the other more inliers, then a metric based on gradient cohesion is applied to probe into current training dynamics, which reflects model's performance on OD task. Experimental results on 4 deep UOD algorithms and 47 real-world datasets and theoretical proofs demonstrate the effectiveness of our proposed early stopping algorithm in enhancing the performance of deep UOD models. Auto Encoder (AE) enhanced by GradStop achieves better performance than itself, other SOTA UOD methods, and even ensemble AEs. Our method provides a robust and effective solution to the problem of performance degradation during training, enabling deep UOD models to achieve better potential in anomaly detection tasks.
翻译:无监督异常检测(UOD)是数据挖掘和机器学习中的一项关键任务,旨在识别与大多数实例显著偏离的样本。在没有任何标签的情况下,深度UOD方法面临模型直接优化目标与异常检测(OD)任务最终性能目标之间的错位问题。本文从训练动态的视角出发,提出一种早停算法以优化深度UOD模型的训练,确保其在OD任务上达到最优性能,而非对整个受污染数据集过拟合。受UOD机制及内点优先现象(直观上模型拟合内点的速度快于异常点)的启发,我们提出GradStop——一种基于采样的无标签算法,用于估计训练过程中模型的实时性能。首先,通过采样方法生成两个集合:一个可能包含更多异常点,另一个则包含更多内点;随后应用基于梯度凝聚性的度量来探查当前训练动态,这反映了模型在OD任务上的性能。在4种深度UOD算法和47个真实数据集上的实验结果及理论证明,验证了所提早停算法在提升深度UOD模型性能方面的有效性。经GradStop增强的自编码器(AE)在性能上超越了其原始版本、其他先进UOD方法,甚至集成AE。本方法为训练过程中性能退化问题提供了鲁棒且有效的解决方案,使深度UOD模型在异常检测任务中能够发挥更优潜力。