Unsupervised Outlier Detection (UOD) is an important data mining task. With the advance of deep learning, deep Outlier Detection (OD) has received broad interest. Most deep UOD models are trained exclusively on clean datasets to learn the distribution of the normal data, which requires huge manual efforts to clean the real-world data if possible. Instead of relying on clean datasets, some approaches directly train and detect on unlabeled contaminated datasets, leading to the need for methods that are robust to such conditions. Ensemble methods emerged as a superior solution to enhance model robustness against contaminated training sets. However, the training time is greatly increased by the ensemble. In this study, we investigate the impact of outliers on the training phase, aiming to halt training on unlabeled contaminated datasets before performance degradation. Initially, we noted that blending normal and anomalous data causes AUC fluctuations, a label-dependent measure of detection accuracy. To circumvent the need for labels, we propose a zero-label entropy metric named Loss Entropy for loss distribution, enabling us to infer optimal stopping points for training without labels. Meanwhile, we theoretically demonstrate negative correlation between entropy metric and the label-based AUC. Based on this, we develop an automated early-stopping algorithm, EntropyStop, which halts training when loss entropy suggests the maximum model detection capability. We conduct extensive experiments on ADBench (including 47 real datasets), and the overall results indicate that AutoEncoder (AE) enhanced by our approach not only achieves better performance than ensemble AEs but also requires under 1\% of training time. Lastly, our proposed metric and early-stopping approach are evaluated on other deep OD models, exhibiting their broad potential applicability.
翻译:无监督异常检测(UOD)是数据挖掘领域的重要任务。随着深度学习的发展,深度异常检测(OD)受到广泛关注。大多数深度UOD模型仅在干净数据集上训练以学习正常数据分布,而这需要耗费大量人力对真实世界数据进行清洗(在可行的情况下)。部分方法不依赖干净数据集,直接在未标记的污染数据集上进行训练与检测,因此需要具备对此类条件的鲁棒性。集成方法作为增强模型对污染训练集鲁棒性的优越方案应运而生,但其训练时间显著增加。本研究探究异常值对训练阶段的影响,旨在对未标记污染数据集进行性能退化前的提前终止训练。我们首先发现,正常数据与异常数据的混合会导致AUC(一种依赖于标签的检测精度指标)出现波动。为规避对标签的依赖,我们提出一种名为"损失熵"的零标签熵度量,通过损失分布无需标签即可推断训练的最优终止点。同时,我们从理论上证明了该熵度量与基于标签的AUC之间存在负相关性。基于此,我们开发了自动早停算法EntropyStop,当损失熵表明模型检测能力达到最大时终止训练。在ADBench(包含47个真实数据集)上的大量实验表明,经本方法增强的自编码器(AE)不仅性能优于集成AE,且训练时间仅为后者的1%以下。最后,我们将所提度量与早停方法应用于其他深度OD模型,验证了其广泛的潜在适用性。