Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: \textbf{\textit{1)}} Granular Region Division, \textbf{\textit{2)}} Prompt Designing, \textbf{\textit{3)}} Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, \eg, WinCLIP and CLIP-AD, and further researches are needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at \url{https://github.com/zhangzjn/GPT-4V-AD}.
翻译:大型多模态模型(LMM)GPT-4V(ision)赋予了GPT-4视觉基础能力,使其能够通过视觉问答(VQA)范式处理某些任务。本文探索了面向VQA的GPT-4V在近期热门的视觉异常检测(AD)任务中的潜力,并首次在流行的MVTec AD和VisA数据集上进行了定性与定量评估。考虑到该任务需要同时进行图像级和像素级评估,所提出的GPT-4V-AD框架包含三个组成部分:\textbf{\textit{1)}} 粒度区域划分,\textbf{\textit{2)}} 提示设计,\textbf{\textit{3)}} 用于便捷定量评估的文本到分割(Text2Segmentation),并在比较分析中进行了不同的尝试。结果表明,GPT-4V通过VQA范式在零样本AD任务中能够取得一定成果,例如在MVTec AD和VisA数据集上分别实现图像级77.1/88.0和像素级68.0/76.6的AU-ROC。然而,其性能与最先进的零样本方法(例如WinCLIP和CLIP-AD)相比仍存在一定差距,需要进一步研究。本研究为面向VQA的LMM在零样本AD任务中的研究提供了基线参考,并提出了若干未来可能的工作方向。代码开源地址:\url{https://github.com/zhangzjn/GPT-4V-AD}。