We consider the paradigm of unsupervised anomaly detection, which involves the identification of anomalies within a dataset in the absence of labeled examples. Though distance-based methods are top-performing for unsupervised anomaly detection, they suffer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm called bagged regularized $k$-distances for anomaly detection (BRDAD) converting the unsupervised anomaly detection problem into a convex optimization problem. Our BRDAD algorithm selects the weights by minimizing the surrogate risk, i.e., the finite sample bound of the empirical risk of the bagged weighted $k$-distances for density estimation (BWDDE). This approach enables us to successfully address the sensitivity challenge of the hyperparameter choice in distance-based algorithms. Moreover, when dealing with large-scale datasets, the efficiency issues can be addressed by the incorporated bagging technique in our BRDAD algorithm. On the theoretical side, we establish fast convergence rates of the AUC regret of our algorithm and demonstrate that the bagging technique significantly reduces the computational complexity. On the practical side, we conduct numerical experiments on anomaly detection benchmarks to illustrate the insensitivity of parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Moreover, promising improvements are brought by applying the bagging technique in our algorithm on real-world datasets.
翻译:本文研究无监督异常检测范式,即在缺乏标签样本的情况下识别数据集中的异常值。尽管基于距离的方法在无监督异常检测中表现优异,但其性能高度依赖于最近邻数量的选择。为此,我们提出一种新型距离算法——Bagged正则化$k$-距离异常检测方法(BRDAD),将无监督异常检测问题转化为凸优化问题。该算法通过最小化替代风险(即用于密度估计的Bagged加权$k$-距离(BWDDE)经验风险的有限样本界)来选取权重,成功解决了基于距离算法中超参数选择的敏感性难题。同时,针对大规模数据集,BRDAD算法中嵌入的Bagging技术可有效解决计算效率问题。在理论层面,我们建立了算法AUC遗憾值的快速收敛速率,并证明Bagging技术显著降低了计算复杂度。在实践层面,我们在异常检测基准数据集上开展数值实验,结果表明与当前最先进的基于距离的方法相比,本算法对参数选择具有不敏感性。此外,在实际数据集上应用Bagging技术后,算法性能获得了显著提升。