NNG-Mix: Improving Semi-supervised Anomaly Detection with Pseudo-anomaly Generation

Anomaly detection (AD) is essential in identifying rare and often critical events in complex systems, finding applications in fields such as network intrusion detection, financial fraud detection, and fault detection in infrastructure and industrial systems. While AD is typically treated as an unsupervised learning task due to the high cost of label annotation, it is more practical to assume access to a small set of labeled anomaly samples from domain experts, as is the case for semi-supervised anomaly detection. Semi-supervised and supervised approaches can leverage such labeled data, resulting in improved performance. In this paper, rather than proposing a new semi-supervised or supervised approach for AD, we introduce a novel algorithm for generating additional pseudo-anomalies on the basis of the limited labeled anomalies and a large volume of unlabeled data. This serves as an augmentation to facilitate the detection of new anomalies. Our proposed algorithm, named Nearest Neighbor Gaussian Mixup (NNG-Mix), efficiently integrates information from both labeled and unlabeled data to generate pseudo-anomalies. We compare the performance of this novel algorithm with commonly applied augmentation techniques, such as Mixup and Cutout. We evaluate NNG-Mix by training various existing semi-supervised and supervised anomaly detection algorithms on the original training data along with the generated pseudo-anomalies. Through extensive experiments on 57 benchmark datasets in ADBench, reflecting different data types, we demonstrate that NNG-Mix outperforms other data augmentation methods. It yields significant performance improvements compared to the baselines trained exclusively on the original training data. Notably, NNG-Mix yields up to 16.4%, 8.8%, and 8.0% improvements on Classical, CV, and NLP datasets in ADBench. Our source code will be available at https://github.com/donghao51/NNG-Mix.

翻译：异常检测（AD）是识别复杂系统中罕见且关键事件的核心技术，广泛应用于网络入侵检测、金融欺诈检测以及基础设施与工业系统故障检测等领域。由于标签标注成本高昂，异常检测通常被视为无监督学习任务，但更实际的场景是假设能够从领域专家处获得少量标注异常样本，这正是半监督异常检测的研究范畴。半监督与监督方法能够利用此类标注数据，从而提升检测性能。本文并未提出新的半监督或监督异常检测方法，而是创新性地设计了一种算法，基于有限标注异常样本与大量未标注数据生成额外伪异常样本，以此作为数据增强手段辅助新型异常的检测。我们提出的"近邻高斯混合"（NNG-Mix）算法，通过高效融合标注数据与未标注数据的信息生成伪异常样本。将该算法与Mixup、Cutout等常用数据增强技术进行性能对比，并通过在原始训练数据与生成的伪异常样本上训练多种现有半监督/监督异常检测算法进行评估。基于ADBench中57个涵盖不同数据类型的基准数据集的全面实验表明，NNG-Mix优于其他数据增强方法。与仅使用原始训练数据的基线模型相比，该方法实现了显著的性能提升：在ADBench的经典数据集、计算机视觉数据集与自然语言处理数据集上，分别取得高达16.4%、8.8%与8.0%的改进效果。源代码将在https://github.com/donghao51/NNG-Mix开源。

相关内容

异常检测

关注 102

在数据挖掘中，异常检测（英语：anomaly detection）对不符合预期模式或数据集中其他项目的项目、事件或观测值的识别。通常异常项目会转变成银行欺诈、结构缺陷、医疗问题、文本错误等类型的问题。异常也被称为离群值、新奇、噪声、偏差和例外。特别是在检测滥用与网络入侵时，有趣性对象往往不是罕见对象，但却是超出预料的突发活动。这种模式不遵循通常统计定义中把异常点看作是罕见对象，于是许多异常检测方法（特别是无监督的方法）将对此类数据失效，除非进行了合适的聚集。相反，聚类分析算法可能可以检测出这些模式形成的微聚类。有三大类异常检测方法。[1] 在假设数据集中大多数实例都是正常的前提下，无监督异常检测方法能通过寻找与其他数据最不匹配的实例来检测出未标记测试数据的异常。监督式异常检测方法需要一个已经被标记“正常”与“异常”的数据集，并涉及到训练分类器（与许多其他的统计分类问题的关键区别是异常检测的内在不均衡性）。半监督式异常检测方法根据一个给定的正常训练数据集创建一个表示正常行为的模型，然后检测由学习模型生成的测试实例的可能性。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日