Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.
翻译:近期,诸如CLIP和SAM这类基础模型在零样本异常分割任务中展现出显著潜力。然而,基于CLIP或基于SAM的零样本异常分割方法仍存在不可忽视的关键缺陷:1)CLIP主要聚焦于不同输入间的全局特征对齐,导致对局部异常区域的定位精度不足;2)SAM在缺乏适当提示约束时易生成大量冗余掩码,进而引发复杂的后处理需求。针对此问题,本文创新性地提出一种名为ClipSAM的CLIP与SAM协同框架,其核心思想在于:利用CLIP的语义理解能力实现异常定位与粗分割,并将此结果作为SAM的提示约束,以优化异常分割结果。具体而言,我们引入关键的统一多尺度跨模态交互模块,通过CLIP的多尺度视觉特征与语言特征交互来推理异常区域;继而设计新型多层掩码细化模块,该模块将定位信息作为多层级提示输入SAM,获取层次化掩码并进行融合。大量实验验证了本方法的有效性,在MVTec-AD和VisA数据集上均达到最优分割性能。