Novelty Detection in Sequential Data by Informed Clustering and Modeling

Novelty detection in discrete sequences is a challenging task, since deviations from the process generating the normal data are often small or intentionally hidden. Novelties can be detected by modeling normal sequences and measuring the deviations of a new sequence from the model predictions. However, in many applications data is generated by several distinct processes so that models trained on all the data tend to over-generalize and novelties remain undetected. We propose to approach this challenge through decomposition: by clustering the data we break down the problem, obtaining simpler modeling task in each cluster which can be modeled more accurately. However, this comes at a trade-off, since the amount of training data per cluster is reduced. This is a particular problem for discrete sequences where state-of-the-art models are data-hungry. The success of this approach thus depends on the quality of the clustering, i.e., whether the individual learning problems are sufficiently simpler than the joint problem. While clustering discrete sequences automatically is a challenging and domain-specific task, it is often easy for human domain experts, given the right tools. In this paper, we adapt a state-of-the-art visual analytics tool for discrete sequence clustering to obtain informed clusters from domain experts and use LSTMs to model each cluster individually. Our extensive empirical evaluation indicates that this informed clustering outperforms automatic ones and that our approach outperforms state-of-the-art novelty detection methods for discrete sequences in three real-world application scenarios. In particular, decomposition outperforms a global model despite less training data on each individual cluster.

翻译：离散序列中的新奇性检测是一项具有挑战性的任务，因为偏离正常数据生成过程的偏差通常很小或有意隐藏。通过建模正常序列并测量新序列与模型预测之间的偏差，可以检测新奇性。然而，在许多应用中，数据由多个不同过程生成，因此基于所有数据训练的模型往往过度泛化，导致新奇性未被检测到。我们提出通过分解来解决这一挑战：通过对数据进行聚类，我们将问题拆解，在每一个聚类中获得更简单的建模任务，从而能够更精确地进行建模。但这需要权衡，因为每个聚类的训练数据量会减少。对于离散序列而言，这是一个特殊问题，因为最先进的模型需要大量数据。因此，该方法的成功取决于聚类的质量，即各个学习问题是否比联合问题足够简单。虽然自动聚类离散序列是一项具有挑战性的领域特定任务，但人类领域专家在合适的工具支持下通常很容易完成。本文中，我们采用一种最先进的离散序列聚类可视化分析工具，从领域专家处获取知情聚类，并利用LSTM对每个聚类分别建模。我们广泛的实证评估表明，这种知情聚类优于自动聚类，并且我们的方法在三个真实应用场景中优于最先进的离散序列新奇性检测方法。值得注意的是，尽管每个独立聚类的训练数据较少，但分解方法仍优于全局模型。