Unsupervised machine learning methods are well suited to searching for anomalies at scale but can struggle with the high-dimensional representation of many modern datasets, hence dimensionality reduction (DR) is often performed first. In this paper we analyse unsupervised anomaly detection (AD) from the perspective of the manifold created in DR. We present an idealised illustration, "Finding Pegasus", and a novel formal framework with which we categorise AD methods and their results into "on manifold" and "off manifold". We define these terms and show how they differ. We then use this insight to develop an approach of combining AD methods which significantly boosts AD recall without sacrificing precision in situations employing high DR. When tested on MNIST data, our approach of combining AD methods improves recall by as much as 16 percent compared with simply combining with the best standalone AD method (Isolation Forest), a result which shows great promise for its application to real-world data.
翻译:无监督机器学习方法非常适合大规模搜索异常,但在处理许多现代数据集的高维表示时可能面临困难,因此通常首先进行降维处理。本文从降维过程中形成的流形角度分析无监督异常检测。我们提出了一个理想化示例“寻找飞马”以及一个新颖的形式化框架,将异常检测方法及其结果归类为“流形上”与“流形外”。我们明确定义了这些术语并阐明其差异。基于这一洞见,我们开发了一种组合异常检测方法的新策略,该策略在高维降维场景中能显著提升异常检测的召回率,同时保持精确度不变。在MNIST数据集上的测试表明,相较于单纯采用最佳独立异常检测方法(孤立森林),我们的组合方法将召回率提升了高达16%,这一结果展现了其在现实世界数据中应用的巨大潜力。