Refining Decision Boundaries In Anomaly Detection Using Similarity Search Within the Feature Space

Detecting rare and diverse anomalies in highly imbalanced datasets-such as Advanced Persistent Threats (APTs) in cybersecurity-remains a fundamental challenge for machine learning systems. Active learning offers a promising direction by strategically querying an oracle to minimize labeling effort, yet conventional approaches often fail to exploit the intrinsic geometric structure of the feature space for model refinement. In this paper, we introduce SDA2E, a Sparse Dual Adversarial Attention-based AutoEncoder designed to learn compact and discriminative latent representations from imbalanced, high-dimensional data. We further propose a similarity-guided active learning framework that integrates three novel strategies to refine decision boundaries efficiently: mormal-like expansion, which enriches the training set with points similar to labeled normals to improve reconstruction fidelity; anomaly-like prioritization, which boosts ranking accuracy by focusing on points resembling known anomalies; and a hybrid strategy that combines both for balanced model refinement and ranking. A key component of our framework is a new similarity measure, Normalized Matching 1s (SIM_NM1), tailored for sparse binary embeddings. We evaluate SDA2E extensively across 52 imbalanced datasets, including multiple DARPA Transparent Computing scenarios, and benchmark it against 15 state-of-the-art anomaly detection methods. Results demonstrate that SDA2E consistently achieves superior ranking performance (nDCG up to 1.0 in several cases) while reducing the required labeled data by up to 80% compared to passive training. Statistical tests confirm the significance of these improvements. Our work establishes a robust, efficient, and statistically validated framework for anomaly detection that is particularly suited to cybersecurity applications such as APT detection.

翻译：在高度不平衡的数据集（如网络安全中的高级持续性威胁）中检测罕见且多样的异常，对于机器学习系统而言仍然是一个根本性挑战。主动学习通过策略性地查询专家以最小化标注工作量，提供了一个有前景的方向，然而传统方法往往未能利用特征空间的内在几何结构进行模型精炼。本文介绍了SDA2E，一种基于稀疏双重对抗注意力的自编码器，旨在从不平衡的高维数据中学习紧凑且具有判别性的潜在表示。我们进一步提出了一种相似性引导的主动学习框架，该框架集成了三种新颖策略以高效优化决策边界：类正常样本扩展，通过添加与已标注正常样本相似的点来丰富训练集，以提高重构保真度；类异常样本优先排序，通过关注与已知异常相似的点来提升排序准确性；以及一种混合策略，结合两者以实现平衡的模型精炼与排序。我们框架的一个关键组成部分是一种新的相似性度量方法——归一化匹配1s（SIM_NM1），专为稀疏二值嵌入设计。我们在52个不平衡数据集（包括多个DARPA透明计算场景）上对SDA2E进行了广泛评估，并将其与15种最先进的异常检测方法进行了基准测试。结果表明，与被动训练相比，SDA2E在将所需标注数据减少高达80%的同时，始终实现了卓越的排序性能（在多个案例中nDCG高达1.0）。统计检验证实了这些改进的显著性。我们的工作为异常检测建立了一个鲁棒、高效且经过统计验证的框架，特别适用于诸如高级持续性威胁检测等网络安全应用。