Contrastive Language-Image Pre-training (CLIP) achieves remarkable performance in various downstream tasks through the alignment of image and text input embeddings and holds great promise for anomaly detection. However, our empirical experiments show that the embeddings of text inputs unexpectedly tightly cluster together, far away from image embeddings, contrary to the model's contrastive training objective to align image-text input pairs. We show that this phenomenon induces a `similarity bias' - in which false negative and false positive errors occur due to bias in the similarities between images and the normal label text embeddings. To address this bias, we propose a novel methodology called BLISS which directly accounts for this similarity bias through the use of an auxiliary, external set of text inputs. BLISS is simple, it does not require strong inductive biases about anomalous behaviour nor an expensive training process, and it significantly outperforms baseline methods on benchmark image datasets, even when access to normal data is extremely limited.
翻译:对比语言-图像预训练(CLIP)通过图像与文本输入嵌入的对齐,在各种下游任务中取得了显著性能,并在异常检测领域展现出巨大潜力。然而,我们的实证实验表明,文本输入的嵌入意外地紧密聚集在一起,且远离图像嵌入,这与模型旨在对齐图文输入对的对比训练目标相悖。我们证明这一现象引发了“相似度偏差”——即由于图像与正常标签文本嵌入之间的相似性偏差,导致假阴性和假阳性错误的发生。为纠正此偏差,我们提出一种名为BLISS的新方法,该方法通过引入一组辅助的外部文本输入直接处理这种相似度偏差。BLISS方法简洁,无需对异常行为施加强归纳偏置,也无需昂贵的训练过程,且在基准图像数据集上显著优于基线方法,即使在正常数据访问极其有限的情况下仍保持优越性能。