Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.

翻译：小样本异常检测（FSAD）方法旨在利用少量已知正常样本识别异常区域。现有方法大多依赖预训练视觉-语言模型（VLMs）的泛化能力，通过文本描述与图像间的特征相似性识别潜在异常区域。然而，由于缺乏细粒度文本描述，这些方法仅能预定义图像级描述来匹配各视觉块标记以定位异常区域，导致图像描述与块级视觉异常间的语义失配，从而限制了定位性能。为解决上述问题，本文提出多层级细粒度语义描述生成方法（MFSC），通过自动化构建流程为现有异常检测数据集提供多层次、细粒度的文本描述。基于MFSC，我们提出名为FineGrainedAD的新框架以提升异常定位性能，该框架包含两个核心组件：多层级可学习提示（MLLP）与多层级语义对齐（MLSA）。MLLP通过自动替换与拼接机制将细粒度语义注入多层级可学习提示，而MLSA设计区域聚合策略与多层级对齐训练，促使可学习提示更精准地对齐相应视觉区域。实验表明，所提FineGrainedAD方法在MVTec-AD和VisA数据集的小样本设置中取得了优异的综合性能。

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日