Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning

from arxiv, Accepted for Publication in Journal of Artificial Intelligence Research. 46 pages; code is available at https://github.com/shubhomoydas/ad_examples. arXiv admin note: substantial text overlap with arXiv:1809.06477

In many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances. Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensemble detectors based on label feedback allows us to quickly discover true anomalies. This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles. First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy. We also present empirical results on real-world data to support our insights and theoretical analysis to support active learning. Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies. Third, we develop a novel active learning algorithm to handle streaming data setting. We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings. Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.

翻译：在许多实际应用（包括计算机安全与欺诈预防）中，异常检测器需由人工分析师进行配置，以最小化误报处理负担。配置检测器的重要方式之一是提供少量实例的真实标签（正常或异常）。近期主动异常发现研究表明，基于标签反馈贪心地查询得分最高的实例并调整集成检测器权重，可快速发现真实异常。本文通过四项主要贡献改进了基于树的集成在异常发现领域的最新技术。首先，我们提出了重要见解，解释了无监督基于树的集成与基于贪心查询选择策略的主动学习在实践中的成功原因，并通过真实数据实验支持该见解，同时提供理论分析佐证主动学习效果。其次，我们开发了一种新型批处理主动学习算法，基于称为紧凑描述的形式化方法描述已发现的异常，从而提升发现异常的多样性。第三，为应对流式数据场景，我们提出了新型主动学习算法，其中的数据漂移检测方法不仅能稳健识别漂移，还能以规范化方式采取纠正措施自适应调整异常检测器。第四，通过大量实验，我们在批处理与流式数据场景下评估了见解及基于树的主动异常发现算法。结果表明：主动学习可比最先进的无监督基线发现显著更多的异常；批处理主动学习算法能发现多样性异常；流式数据场景下的算法与批处理场景具有可比性。

相关内容

主动学习

关注 243

主动学习是机器学习（更普遍的说是人工智能）的一个子领域，在统计学领域也叫查询学习、最优实验设计。“学习模块”和“选择策略”是主动学习算法的2个基本且重要的模块。主动学习是“一种学习方法，在这种方法中，学生会主动或体验性地参与学习过程，并且根据学生的参与程度，有不同程度的主动学习。” （Bonwell＆Eison 1991）Bonwell＆Eison（1991）指出：“学生除了被动地听课以外，还从事其他活动。” 在高等教育研究协会（ASHE）的一份报告中，作者讨论了各种促进主动学习的方法。他们引用了一些文献，这些文献表明学生不仅要做听，还必须做更多的事情才能学习。他们必须阅读，写作，讨论并参与解决问题。此过程涉及三个学习领域，即知识，技能和态度（KSA）。这种学习行为分类法可以被认为是“学习过程的目标”。特别是，学生必须从事诸如分析，综合和评估之类的高级思维任务。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日