Memorize Early, Then Query: Inlier-Memorization-Guided Active Outlier Detection

Outlier detection (OD) aims to identify abnormal instances, known as outliers or anomalies, by learning typical patterns of normal data, or inliers. Performing OD under an unsupervised regime-without any information about anomalous instances in the training data-is challenging. A recently observed phenomenon, known as the inlier-memorization (IM) effect, where deep generative models (DGMs) tend to memorize inlier patterns during early training, provides a promising signal for distinguishing outliers. However, existing unsupervised approaches that rely solely on the IM effect still struggle when inliers and outliers are not well-separated or when outliers form dense clusters. To address these limitations, we incorporate active learning to selectively acquire informative labels, and propose IMBoost, a novel framework that explicitly reinforces the IM effect to improve outlier detection. Our method consists of two stages: 1) a warm-up phase that induces and promotes the IM effect, and 2) a polarization phase in which actively queried samples are used to maximize the discrepancy between inlier and outlier scores. In particular, we propose a novel query strategy and tailored loss function in the polarization phase to effectively identify informative samples and fully leverage the limited labeling budget. We provide a theoretical analysis showing that the IMBoost consistently decreases inlier risk while increasing outlier risk throughout training, thereby amplifying their separation. Extensive experiments on diverse benchmark datasets demonstrate that IMBoost not only significantly outperforms state-of-the-art active OD methods but also requires substantially less computational cost.

翻译：离群点检测旨在通过学习正常数据（即内点）的典型模式来识别异常实例（即离群点或异常值）。在无监督机制下进行离群点检测——即训练数据中不包含任何异常实例信息——具有挑战性。近期观察到一种称为内点记忆效应的现象，即深度生成模型在训练早期倾向于记忆内点模式，这为区分离群点提供了有前景的信号。然而，现有仅依赖内点记忆效应的无监督方法在内点与离群点分离不佳或离群点形成密集簇时仍面临困难。为克服这些局限，我们引入主动学习以选择性获取信息性标签，并提出IMBoost这一新颖框架，该框架通过显式强化内点记忆效应来提升离群点检测性能。我们的方法包含两个阶段：1）诱导并增强内点记忆效应的预热阶段；2）利用主动查询样本最大化内点与离群点评分差异的极化阶段。特别地，我们在极化阶段提出了一种新颖的查询策略和定制化的损失函数，以有效识别信息性样本并充分利用有限的标注预算。我们通过理论分析表明，IMBoost在训练过程中持续降低内点风险同时增加离群点风险，从而放大二者的分离度。在多样化基准数据集上的大量实验证明，IMBoost不仅显著优于当前最先进的主动离群点检测方法，且所需计算成本大幅降低。