Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/
翻译:零样本异常检测(ZSAD)是一种新兴的异常检测范式。与需要大量正常样本训练模型的传统无监督异常检测设置不同,ZSAD在处理数据受限的现实场景时更具实用性。近年来,多模态大语言模型(MLLMs)在各种视觉任务中展现出革命性的推理能力。然而,由于缺乏相应的数据集和基准测试,图像异常的推理研究仍显不足。为促进异常检测与推理领域的研究,我们建立了首个视觉指令调优数据集Anomaly-Instruct-125k及评估基准VisA-D&R。通过该基准的调研,我们发现当前如GPT-4o等MLLMs无法准确检测并描述图像中的细粒度异常细节。为解决此问题,我们提出了Anomaly-OneVision(Anomaly-OV),这是首个面向ZSAD与推理的专用视觉助手。受人类视觉检查行为的启发,Anomaly-OV利用“二次观察特征匹配”(LTFM)机制自适应地选择并强调异常视觉标记。大量实验表明,Anomaly-OV在检测与推理方面均较先进的通用模型取得了显著提升。本文还提供了在医学及三维异常检测领域的扩展研究,以供未来探索。项目页面链接:https://xujiacong.github.io/Anomaly-OV/