Anomaly detection (AD) tasks have been solved using machine learning algorithms in various domains and applications. The great majority of these algorithms use normal data to train a residual-based model, and assign anomaly scores to unseen samples based on their dissimilarity with the learned normal regime. The underlying assumption of these approaches is that anomaly-free data is available for training. This is, however, often not the case in real-world operational settings, where the training data may be contaminated with a certain fraction of abnormal samples. Training with contaminated data, in turn, inevitably leads to a deteriorated AD performance of the residual-based algorithms. In this paper we introduce a framework for a fully unsupervised refinement of contaminated training data for AD tasks. The framework is generic and can be applied to any residual-based machine learning model. We demonstrate the application of the framework to two public datasets of multivariate time series machine data from different application fields. We show its clear superiority over the naive approach of training with contaminated data without refinement. Moreover, we compare it to the ideal, unrealistic reference in which anomaly-free data would be available for training. Since the approach exploits information from the anomalies, and not only from the normal regime, it is comparable and often outperforms the ideal baseline as well.
翻译:异常检测(Anomaly Detection, AD)任务已在多个领域和应用中通过机器学习算法得到解决。绝大多数此类算法使用正常数据训练基于残差的模型,并根据未知样本与学习到的正常模式的差异程度为其分配异常分值。这些方法隐含的假设是训练数据中不含异常样本。然而,在实际操作环境中,训练数据往往受到一定比例异常样本的污染。使用污染数据进行训练不可避免地会导致基于残差的算法在异常检测性能上恶化。本文提出了一种针对AD任务中对污染训练数据进行完全无监督精炼的框架。该框架具有通用性,适用于任何基于残差的机器学习模型。我们展示了该框架在两个来自不同应用领域的多元时间序列机器数据公开数据集上的应用效果。结果表明,相较于直接使用污染数据训练的朴素方法,该框架具有明显优越性。此外,我们将其与理想化但非现实的基准(即训练数据中无异常样本可用的情况)进行对比。由于该方法不仅利用正常模式信息,还利用了异常信息,因此其性能可与理想基准相媲美,甚至常优于后者。