Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning

from arxiv, Accepted for Publication in Journal of Artificial Intelligence Research. 46 pages; code is available at https://github.com/shubhomoydas/ad_examples. arXiv admin note: substantial text overlap with arXiv:1809.06477

In many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances. Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensemble detectors based on label feedback allows us to quickly discover true anomalies. This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles. First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy. We also present empirical results on real-world data to support our insights and theoretical analysis to support active learning. Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies. Third, we develop a novel active learning algorithm to handle streaming data setting. We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings. Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.

翻译：在计算机安全与欺诈预防等众多实际异常检测应用中，异常检测器必须可由人类分析师配置，以最小化处理误报的工作量。配置检测器的重要途径之一是为少量实例提供真实标签（正常或异常）。近期关于主动异常发现的研究表明，贪心地查询得分最高的实例并根据标签反馈调整集成检测器的权重，能够快速发现真实异常。本文在基于树集成方法的异常发现领域做出四项主要贡献以提升现有技术水平。第一，我们提出重要见解，揭示了无监督树集成方法与基于贪心查询选择策略的主动学习取得实际成功的原因，并给出支持该见解的实证结果及支撑主动学习的理论分析。第二，我们基于名为"紧凑描述"的形式化方法开发新型批量主动学习算法，通过描述已发现的异常来提升异常发现的多样性。第三，我们提出适应流式数据场景的新型主动学习算法，并设计数据漂移检测机制，该机制不仅能稳健检测漂移，还能以系统化方式采取纠正措施调整异常检测器。第四，我们通过大规模实验在批量与流式数据场景中验证了所提出的见解与基于树集成的主动异常发现算法。结果表明：主动学习较现有最优无监督基线方法能发现显著更多的异常；批量主动学习算法可发现多样化的异常；且流式数据场景下的算法性能与批量场景具有竞争力。

相关内容

主动学习

关注 243

主动学习是机器学习（更普遍的说是人工智能）的一个子领域，在统计学领域也叫查询学习、最优实验设计。“学习模块”和“选择策略”是主动学习算法的2个基本且重要的模块。主动学习是“一种学习方法，在这种方法中，学生会主动或体验性地参与学习过程，并且根据学生的参与程度，有不同程度的主动学习。” （Bonwell＆Eison 1991）Bonwell＆Eison（1991）指出：“学生除了被动地听课以外，还从事其他活动。” 在高等教育研究协会（ASHE）的一份报告中，作者讨论了各种促进主动学习的方法。他们引用了一些文献，这些文献表明学生不仅要做听，还必须做更多的事情才能学习。他们必须阅读，写作，讨论并参与解决问题。此过程涉及三个学习领域，即知识，技能和态度（KSA）。这种学习行为分类法可以被认为是“学习过程的目标”。特别是，学生必须从事诸如分析，综合和评估之类的高级思维任务。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日