Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/
翻译:零样本异常检测(ZSAD)是一种新兴的异常检测范式。与需要大量正常样本来训练模型的传统无监督异常检测设置不同,ZSAD在处理数据受限的现实场景时更具实用性。近年来,多模态大语言模型(MLLMs)在各种视觉任务中展现出革命性的推理能力。然而,由于缺乏相应的数据集和基准,针对图像异常的推理研究仍显不足。为了推动异常检测与推理领域的研究,我们建立了首个视觉指令调优数据集 Anomaly-Instruct-125k 以及评估基准 VisA-D&R。通过在我们的基准上进行研究,我们发现当前如 GPT-4o 等 MLLMs 无法准确检测和描述图像中细粒度的异常细节。为解决此问题,我们提出了 Anomaly-OneVision(Anomaly-OV),这是首个专用于 ZSAD 与推理的专家级视觉助手。受人类视觉检查行为的启发,Anomaly-OV 利用一种“再看一次”特征匹配(LTFM)机制来自适应地选择和强调异常视觉标记。大量实验表明,Anomaly-OV 在检测和推理两方面均比先进的通用模型取得了显著提升。我们还提供了在医学和三维异常检测上的扩展应用以供未来研究。项目页面链接:https://xujiacong.github.io/Anomaly-OV/