Detecting test data deviating from training data is a central problem for safe and robust machine learning. Likelihoods learned by a generative model, e.g., a normalizing flow via standard log-likelihood training, perform poorly as an outlier score. We propose to use an unlabelled auxiliary dataset and a probabilistic outlier score for outlier detection. We use a self-supervised feature extractor trained on the auxiliary dataset and train a normalizing flow on the extracted features by maximizing the likelihood on in-distribution data and minimizing the likelihood on the contrastive dataset. We show that this is equivalent to learning the normalized positive difference between the in-distribution and the contrastive feature density. We conduct experiments on benchmark datasets and compare to the likelihood, the likelihood ratio and state-of-the-art anomaly detection methods.
翻译:检测与训练数据存在偏差的测试数据是安全鲁棒机器学习的核心问题。通过生成模型(例如采用标准对数似然训练的归一化流)学习的似然值作为异常得分时表现不佳。我们提出利用无标签辅助数据集和概率异常得分进行异常检测。该方法使用在辅助数据集上训练的自监督特征提取器,对提取的特征通过最大化分布内数据的似然值并最小化对比数据集上的似然值来训练归一化流。我们证明这等价于学习分布内密度与对比特征密度之间的归一化正差异。我们在基准数据集上展开实验,将所提方法与似然值、似然比以及当前最先进的异常检测方法进行了比较。