Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.
翻译:最近,CLIP和SAM等基础模型在零样本异常分割(ZSAS)任务中展现出具有前景的性能。然而,无论是基于CLIP还是基于SAM的ZSAS方法仍存在不可忽视的关键缺陷:1)CLIP主要关注不同输入间的全局特征对齐,导致对局部异常部分的分割不精确;2)SAM在没有适当提示约束的情况下往往生成大量冗余掩码,导致复杂的后处理需求。本文创新性地提出一种名为ClipSAM的CLIP与SAM协作框架,用于零样本异常分割。ClipSAM的核心思路是利用CLIP的语义理解能力进行异常定位和粗略分割,并以此作为SAM的提示约束,以优化异常分割结果。具体而言,我们引入一个关键的统一多尺度跨模态交互(UMCI)模块,用于在CLIP的多个尺度上实现语言与视觉特征的交互推理异常位置。随后,我们设计了一种新颖的多层级掩码细化(MMR)模块,该模块利用位置信息作为SAM的多层级提示,获取分层掩码并对其进行融合。大量实验验证了本方法的有效性,在MVTec-AD和VisA数据集上实现了最优分割性能。