Anomaly detection (AD) is essential in identifying rare and often critical events in complex systems, finding applications in fields such as network intrusion detection, financial fraud detection, and fault detection in infrastructure and industrial systems. While AD is typically treated as an unsupervised learning task due to the high cost of label annotation, it is more practical to assume access to a small set of labeled anomaly samples from domain experts, as is the case for semi-supervised anomaly detection. Semi-supervised and supervised approaches can leverage such labeled data, resulting in improved performance. In this paper, rather than proposing a new semi-supervised or supervised approach for AD, we introduce a novel algorithm for generating additional pseudo-anomalies on the basis of the limited labeled anomalies and a large volume of unlabeled data. This serves as an augmentation to facilitate the detection of new anomalies. Our proposed algorithm, named Nearest Neighbor Gaussian Mixup (NNG-Mix), efficiently integrates information from both labeled and unlabeled data to generate pseudo-anomalies. We compare the performance of this novel algorithm with commonly applied augmentation techniques, such as Mixup and Cutout. We evaluate NNG-Mix by training various existing semi-supervised and supervised anomaly detection algorithms on the original training data along with the generated pseudo-anomalies. Through extensive experiments on 57 benchmark datasets in ADBench, reflecting different data types, we demonstrate that NNG-Mix outperforms other data augmentation methods. It yields significant performance improvements compared to the baselines trained exclusively on the original training data. Notably, NNG-Mix yields up to 16.4%, 8.8%, and 8.0% improvements on Classical, CV, and NLP datasets in ADBench. Our source code will be available at https://github.com/donghao51/NNG-Mix.
翻译:异常检测(AD)是识别复杂系统中罕见且关键事件的核心技术,广泛应用于网络入侵检测、金融欺诈检测以及基础设施与工业系统故障检测等领域。由于标签标注成本高昂,异常检测通常被视为无监督学习任务,但更实际的场景是假设能够从领域专家处获得少量标注异常样本,这正是半监督异常检测的研究范畴。半监督与监督方法能够利用此类标注数据,从而提升检测性能。本文并未提出新的半监督或监督异常检测方法,而是创新性地设计了一种算法,基于有限标注异常样本与大量未标注数据生成额外伪异常样本,以此作为数据增强手段辅助新型异常的检测。我们提出的"近邻高斯混合"(NNG-Mix)算法,通过高效融合标注数据与未标注数据的信息生成伪异常样本。将该算法与Mixup、Cutout等常用数据增强技术进行性能对比,并通过在原始训练数据与生成的伪异常样本上训练多种现有半监督/监督异常检测算法进行评估。基于ADBench中57个涵盖不同数据类型的基准数据集的全面实验表明,NNG-Mix优于其他数据增强方法。与仅使用原始训练数据的基线模型相比,该方法实现了显著的性能提升:在ADBench的经典数据集、计算机视觉数据集与自然语言处理数据集上,分别取得高达16.4%、8.8%与8.0%的改进效果。源代码将在https://github.com/donghao51/NNG-Mix开源。