Large Vision-Language Models (LVLMs) such as MiniGPT-4 and LLaVA have demonstrated the capability of understanding images and achieved remarkable performance in various visual tasks. Despite their strong abilities in recognizing common objects due to extensive training datasets, they lack specific domain knowledge and have a weaker understanding of localized details within objects, which hinders their effectiveness in the Industrial Anomaly Detection (IAD) task. On the other hand, most existing IAD methods only provide anomaly scores and necessitate the manual setting of thresholds to distinguish between normal and abnormal samples, which restricts their practical implementation. In this paper, we explore the utilization of LVLM to address the IAD problem and propose AnomalyGPT, a novel IAD approach based on LVLM. We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image. We also employ an image decoder to provide fine-grained semantic and design a prompt learner to fine-tune the LVLM using prompt embeddings. Our AnomalyGPT eliminates the need for manual threshold adjustments, thus directly assesses the presence and locations of anomalies. Additionally, AnomalyGPT supports multi-turn dialogues and exhibits impressive few-shot in-context learning capabilities. With only one normal shot, AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset. Code is available at https://github.com/CASIA-IVA-Lab/AnomalyGPT.
翻译:大型视觉语言模型(LVLMs),如MiniGPT-4和LLaVA,已展现出理解图像的能力,并在多种视觉任务中取得了显著性能。尽管由于大规模训练数据集,它们在识别常见物体方面具有强大能力,但缺乏特定领域知识,且对物体内部局部细节的理解较弱,这限制了它们在工业异常检测(IAD)任务中的有效性。另一方面,现有大多数IAD方法仅提供异常分数,需要手动设置阈值来区分正常与异常样本,这限制了其实际应用。本文探索了利用LVLM解决IAD问题,并提出了AnomalyGPT——一种基于LVLM的新型IAD方法。我们通过模拟异常图像并为每张图像生成相应的文本描述来构建训练数据。同时,我们采用图像解码器提供细粒度语义,并设计提示学习器利用提示嵌入对LVLM进行微调。我们的AnomalyGPT消除了手动调整阈值的需求,从而直接评估异常的存在性及位置。此外,AnomalyGPT支持多轮对话,并展现出强大的少样本上下文学习能力。仅需正常样本一次输入,AnomalyGPT即在MVTec-AD数据集上实现了最优性能:准确率86.1%、图像级AUC 94.1%及像素级AUC 95.3%。代码可在https://github.com/CASIA-IVA-Lab/AnomalyGPT获取。