In many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances. Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensemble detectors based on label feedback allows us to quickly discover true anomalies. This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles. First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy. We also present empirical results on real-world data to support our insights and theoretical analysis to support active learning. Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies. Third, we develop a novel active learning algorithm to handle streaming data setting. We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings. Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.
翻译:在计算机安全与欺诈预防等众多实际异常检测应用中,异常检测器必须可由人类分析师配置,以最小化处理误报的工作量。配置检测器的重要途径之一是为少量实例提供真实标签(正常或异常)。近期关于主动异常发现的研究表明,贪心地查询得分最高的实例并根据标签反馈调整集成检测器的权重,能够快速发现真实异常。本文在基于树集成方法的异常发现领域做出四项主要贡献以提升现有技术水平。第一,我们提出重要见解,揭示了无监督树集成方法与基于贪心查询选择策略的主动学习取得实际成功的原因,并给出支持该见解的实证结果及支撑主动学习的理论分析。第二,我们基于名为"紧凑描述"的形式化方法开发新型批量主动学习算法,通过描述已发现的异常来提升异常发现的多样性。第三,我们提出适应流式数据场景的新型主动学习算法,并设计数据漂移检测机制,该机制不仅能稳健检测漂移,还能以系统化方式采取纠正措施调整异常检测器。第四,我们通过大规模实验在批量与流式数据场景中验证了所提出的见解与基于树集成的主动异常发现算法。结果表明:主动学习较现有最优无监督基线方法能发现显著更多的异常;批量主动学习算法可发现多样化的异常;且流式数据场景下的算法性能与批量场景具有竞争力。