Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models

Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/

翻译：零样本异常检测（ZSAD）是一种新兴的异常检测范式。与需要大量正常样本来训练模型的传统无监督异常检测设置不同，ZSAD在处理数据受限的现实场景时更具实用性。近年来，多模态大语言模型（MLLMs）在各种视觉任务中展现出革命性的推理能力。然而，由于缺乏相应的数据集和基准，针对图像异常的推理研究仍显不足。为了推动异常检测与推理领域的研究，我们建立了首个视觉指令调优数据集 Anomaly-Instruct-125k 以及评估基准 VisA-D&R。通过在我们的基准上进行研究，我们发现当前如 GPT-4o 等 MLLMs 无法准确检测和描述图像中细粒度的异常细节。为解决此问题，我们提出了 Anomaly-OneVision（Anomaly-OV），这是首个专用于 ZSAD 与推理的专家级视觉助手。受人类视觉检查行为的启发，Anomaly-OV 利用一种“再看一次”特征匹配（LTFM）机制来自适应地选择和强调异常视觉标记。大量实验表明，Anomaly-OV 在检测和推理两方面均比先进的通用模型取得了显著提升。我们还提供了在医学和三维异常检测上的扩展应用以供未来研究。项目页面链接：https://xujiacong.github.io/Anomaly-OV/

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日