Image Anomaly Detection has been a challenging task in Computer Vision field. The advent of Vision-Language models, particularly the rise of CLIP-based frameworks, has opened new avenues for zero-shot anomaly detection. Recent studies have explored the use of CLIP by aligning images with normal and prompt descriptions. However, the exclusive dependence on textual guidance often falls short, highlighting the critical importance of additional visual references. In this work, we introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. This dual-image strategy markedly enhanced both anomaly classification and localization performances. Furthermore, we have strengthened our model with a test-time adaptation module that incorporates synthesized anomalies to refine localization capabilities. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
翻译:图像异常检测一直是计算机视觉领域的挑战性任务。视觉-语言模型的出现,特别是基于CLIP框架的兴起,为零样本异常检测开辟了新途径。近期研究通过将图像与正常及提示描述进行对齐来探索CLIP的应用。然而,仅依赖文本引导往往表现不足,凸显了额外视觉参考的关键重要性。本文提出一种双图像增强CLIP方法,采用联合视觉-语言评分系统。该方法处理图像对,将每张图像作为另一张图像的视觉参考,从而以视觉上下文丰富推理过程。这种双图像策略显著提升了异常分类和定位性能。此外,我们通过测试时自适应模块增强模型,该模块融合合成异常以优化定位能力。本方法充分挖掘了视觉-语言联合异常检测的潜力,并在多个数据集上展现出与当前SOTA方法相当的性能。