In many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances. Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensemble detectors based on label feedback allows us to quickly discover true anomalies. This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles. First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy. We also present empirical results on real-world data to support our insights and theoretical analysis to support active learning. Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies. Third, we develop a novel active learning algorithm to handle streaming data setting. We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings. Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.
翻译:在许多实际应用(包括计算机安全与欺诈预防)中,异常检测器需由人工分析师进行配置,以最小化误报处理负担。配置检测器的重要方式之一是提供少量实例的真实标签(正常或异常)。近期主动异常发现研究表明,基于标签反馈贪心地查询得分最高的实例并调整集成检测器权重,可快速发现真实异常。本文通过四项主要贡献改进了基于树的集成在异常发现领域的最新技术。首先,我们提出了重要见解,解释了无监督基于树的集成与基于贪心查询选择策略的主动学习在实践中的成功原因,并通过真实数据实验支持该见解,同时提供理论分析佐证主动学习效果。其次,我们开发了一种新型批处理主动学习算法,基于称为紧凑描述的形式化方法描述已发现的异常,从而提升发现异常的多样性。第三,为应对流式数据场景,我们提出了新型主动学习算法,其中的数据漂移检测方法不仅能稳健识别漂移,还能以规范化方式采取纠正措施自适应调整异常检测器。第四,通过大量实验,我们在批处理与流式数据场景下评估了见解及基于树的主动异常发现算法。结果表明:主动学习可比最先进的无监督基线发现显著更多的异常;批处理主动学习算法能发现多样性异常;流式数据场景下的算法与批处理场景具有可比性。