Anomaly detection (AD) tasks have been solved using machine learning algorithms in various domains and applications. The great majority of these algorithms use normal data to train a residual-based model, and assign anomaly scores to unseen samples based on their dissimilarity with the learned normal regime. The underlying assumption of these approaches is that anomaly-free data is available for training. This is, however, often not the case in real-world operational settings, where the training data may be contaminated with a certain fraction of abnormal samples. Training with contaminated data, in turn, inevitably leads to a deteriorated AD performance of the residual-based algorithms. In this paper we introduce a framework for a fully unsupervised refinement of contaminated training data for AD tasks. The framework is generic and can be applied to any residual-based machine learning model. We demonstrate the application of the framework to two public datasets of multivariate time series machine data from different application fields. We show its clear superiority over the naive approach of training with contaminated data without refinement. Moreover, we compare it to the ideal, unrealistic reference in which anomaly-free data would be available for training. Since the approach exploits information from the anomalies, and not only from the normal regime, it is comparable and often outperforms the ideal baseline as well.
翻译:异常检测(AD)任务已在多个领域和应用中通过机器学习算法求解。绝大多数此类算法使用正常数据训练基于残差的模型,并根据未知样本与所学正常模式的差异程度为其分配异常分数。这些方法的基本假设是训练数据中不含异常。然而,在真实运行场景中,训练数据往往可能被一定比例的异常样本污染。使用含污染数据训练,不可避免地会导致残差算法异常检测性能的下降。本文提出一种用于AD任务中完全无监督净化含污染训练数据的通用框架,该框架可应用于任意基于残差的机器学习模型。我们分别展示了两类不同应用领域多变量时间序列机器数据的公开数据集上框架的应用效果。研究证明,该框架显著优于不使用净化的含污染数据直接训练的朴素方法。此外,我们将其与理想但不可实现的参考基线(即假设存在无异常训练数据)进行对比。由于该方法不仅利用正常模式信息,还挖掘了异常数据中的信息,其性能可与理想基线相媲美,并常能超越后者。