We consider the paradigm of unsupervised anomaly detection, which involves the identification of anomalies within a dataset in the absence of labeled examples. Though distance-based methods are top-performing for unsupervised anomaly detection, they suffer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm called bagged regularized $k$-distances for anomaly detection (BRDAD) converting the unsupervised anomaly detection problem into a convex optimization problem. Our BRDAD algorithm selects the weights by minimizing the surrogate risk, i.e., the finite sample bound of the empirical risk of the bagged weighted $k$-distances for density estimation (BWDDE). This approach enables us to successfully address the sensitivity challenge of the hyperparameter choice in distance-based algorithms. Moreover, when dealing with large-scale datasets, the efficiency issues can be addressed by the incorporated bagging technique in our BRDAD algorithm. On the theoretical side, we establish fast convergence rates of the AUC regret of our algorithm and demonstrate that the bagging technique significantly reduces the computational complexity. On the practical side, we conduct numerical experiments on anomaly detection benchmarks to illustrate the insensitivity of parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Moreover, promising improvements are brought by applying the bagging technique in our algorithm on real-world datasets.
翻译:本文研究无监督异常检测范式,即在缺乏标注样本的情况下识别数据集中的异常点。尽管基于距离的方法在无监督异常检测中表现优异,但其性能严重依赖于最近邻数量的选择。为此,我们提出一种新型距离驱动算法——基于装袋正则化$k$-距离的异常检测方法(BRDAD),将无监督异常检测问题转化为凸优化问题。该算法通过最小化替代风险(即用于密度估计的装袋加权$k$-距离(BWDDE)经验风险的有限样本界)来优化权重,从而有效解决距离算法对超参数选择的敏感性难题。针对大规模数据集,BRDAD算法中集成的装袋技术可显著提升计算效率。理论层面,我们证明了算法AUC遗憾值的快速收敛速率,并论证装袋技术能大幅降低计算复杂度。实验层面,我们在异常检测基准数据集上的数值实验表明,相较于其他前沿距离方法,所提算法对参数选择具有不敏感性。此外,在真实数据集上应用装袋技术带来了显著的性能提升。