Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior work compensates with complex auxiliary modules yet largely overlooks the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at github.com/AlirezaSalehy/Tipsomaly.
翻译:异常检测在安全关键环境中识别偏离预期行为的情况。当目标域的正常数据不可用时,零样本异常检测利用视觉-语言模型。然而,CLIP粗糙的图文对齐因(i)空间错位和(ii)对细粒度异常的低敏感性而限制了定位与检测性能;先前工作通过复杂的辅助模块进行补偿,却很大程度上忽视了主干网络的选择。我们重新审视主干网络,采用TIPS——一种通过空间感知目标训练的视觉-语言模型。虽然TIPS缓解了CLIP的问题,但它揭示了全局特征与局部特征之间的分布差异。我们通过解耦提示(固定提示用于图像级检测,可学习提示用于像素级定位)以及将局部证据注入全局评分来解决这一问题。在不使用CLIP特定技巧的情况下,我们基于TIPS的流程在七个工业数据集上将图像级性能提升了1.1-3.9%,像素级性能提升了1.5-6.9%,以简洁的架构实现了强大的泛化能力。代码发布于github.com/AlirezaSalehy/Tipsomaly。