Anomaly detection plays an increasingly important role in various fields for critical tasks such as intrusion detection in cybersecurity, financial risk detection, and human health monitoring. A variety of anomaly detection methods have been proposed, and a category based on the isolation forest mechanism stands out due to its simplicity, effectiveness, and efficiency, e.g., iForest is often employed as a state-of-the-art detector for real deployment. While the majority of isolation forests use the binary structure, a framework LSHiForest has demonstrated that the multi-fork isolation tree structure can lead to better detection performance. However, there is no theoretical work answering the fundamentally and practically important question on the optimal tree structure for an isolation forest with respect to the branching factor. In this paper, we establish a theory on isolation efficiency to answer the question and determine the optimal branching factor for an isolation tree. Based on the theoretical underpinning, we design a practical optimal isolation forest OptIForest incorporating clustering based learning to hash which enables more information to be learned from data for better isolation quality. The rationale of our approach relies on a better bias-variance trade-off achieved by bias reduction in OptIForest. Extensive experiments on a series of benchmarking datasets for comparative and ablation studies demonstrate that our approach can efficiently and robustly achieve better detection performance in general than the state-of-the-arts including the deep learning based methods.
翻译:异常检测在网络安全入侵检测、金融风险检测和人类健康监测等关键任务中扮演着日益重要的角色。目前已提出多种异常检测方法,其中基于隔离森林机制的一类方法因其简单性、有效性和高效性而脱颖而出,例如iForest常被用作实际部署中的最先进检测器。尽管大多数隔离森林采用二叉树结构,但LSHiForest框架已证明多叉隔离树结构能够实现更优的检测性能。然而,目前尚无理论工作回答一个基础且重要的实际问题:对于隔离森林而言,基于分支因子的最优树结构是什么?本文建立了关于隔离效率的理论,以回答该问题并确定隔离树的最优分支因子。基于这一理论基础,我们设计了一种实用的最优隔离森林OptIForest,它融合了基于聚类的学习哈希技术,使模型能够从数据中学习更多信息,从而提升隔离质量。该方法的核心在于通过降低OptIForest中的偏差,实现更优的偏差-方差权衡。在多个基准数据集上进行的对比实验和消融研究表明,我们的方法在整体上能够高效且稳健地实现优于包括基于深度学习方法在内的现有最先进技术的检测性能。