Contrastive Language-Image Pre-training (CLIP) models have shown promising performance on zero-shot visual recognition tasks by learning visual representations under natural language supervision. Recent studies attempt the use of CLIP to tackle zero-shot anomaly detection by matching images with normal and abnormal state prompts. However, since CLIP focuses on building correspondence between paired text prompts and global image-level representations, the lack of patch-level vision to text alignment limits its capability on precise visual anomaly localization. In this work, we introduce a training-free adaptation (TFA) framework of CLIP for zero-shot anomaly localization. In the visual encoder, we innovate a training-free value-wise attention mechanism to extract intrinsic local tokens of CLIP for patch-level local description. From the perspective of text supervision, we particularly design a unified domain-aware contrastive state prompting template. On top of the proposed TFA, we further introduce a test-time adaptation (TTA) mechanism to refine anomaly localization results, where a layer of trainable parameters in the adapter is optimized using TFA's pseudo-labels and synthetic noise-corrupted tokens. With both TFA and TTA adaptation, we significantly exploit the potential of CLIP for zero-shot anomaly localization and demonstrate the effectiveness of our proposed methods on various datasets.
翻译:摘要:对比语言-图像预训练(CLIP)模型通过自然语言监督学习视觉表征,在零样本视觉识别任务中展现出优异性能。近期研究尝试利用CLIP通过匹配图像与正常/异常状态提示实现零样本异常检测。然而,由于CLIP侧重于构建成对文本提示与全局图像级表征的对应关系,缺乏块级视觉-文本对齐能力,限制了其在精确视觉异常定位中的表现。本文提出一种无需训练的CLIP零样本异常定位自适应框架(TFA)。在视觉编码器中,我们创新性地提出无需训练的值感知注意力机制,用于提取CLIP的内在局部标记以实现块级局部描述;在文本监督层面,我们特别设计了统一的领域感知对比状态提示模板。基于所提出的TFA,我们进一步引入测试时自适应(TTA)机制优化异常定位结果,该机制通过TFA生成的伪标签和合成噪声干扰标记,优化适配器中单层可训练参数。通过TFA与TTA的双重自适应,我们充分挖掘了CLIP在零样本异常定位中的潜力,并在多个数据集上验证了所提方法的有效性。