Existing industrial anomaly detection (IAD) methods predict anomaly scores for both anomaly detection and localization. However, they struggle to perform a multi-turn dialog and detailed descriptions for anomaly regions, e.g., color, shape, and categories of industrial anomalies. Recently, large multimodal (i.e., vision and language) models (LMMs) have shown eminent perception abilities on multiple vision tasks such as image captioning, visual understanding, visual reasoning, etc., making it a competitive potential choice for more comprehensible anomaly detection. However, the knowledge about anomaly detection is absent in existing general LMMs, while training a specific LMM for anomaly detection requires a tremendous amount of annotated data and massive computation resources. In this paper, we propose a novel large multi-modal model by applying vision experts for industrial anomaly detection (dubbed Myriad), which leads to definite anomaly detection and high-quality anomaly description. Specifically, we adopt MiniGPT-4 as the base LMM and design an Expert Perception module to embed the prior knowledge from vision experts as tokens which are intelligible to Large Language Models (LLMs). To compensate for the errors and confusions of vision experts, we introduce a domain adapter to bridge the visual representation gaps between generic and industrial images. Furthermore, we propose a Vision Expert Instructor, which enables the Q-Former to generate IAD domain vision-language tokens according to vision expert prior. Extensive experiments on MVTec-AD and VisA benchmarks demonstrate that our proposed method not only performs favorably against state-of-the-art methods under the 1-class and few-shot settings, but also provide definite anomaly prediction along with detailed descriptions in IAD domain.
翻译:现有工业异常检测方法通过预测异常分数实现异常检测与定位,但在异常区域的多轮对话和详细描述(如工业异常的颜色、形状和类别)方面存在困难。近期,大型多模态模型在图像描述、视觉理解、视觉推理等多项视觉任务中展现出卓越的感知能力,成为实现更可解释异常检测的潜在选择。然而,现有通用大型多模态模型缺乏异常检测知识,而训练专用的异常检测模型需要海量标注数据和巨大计算资源。本文提出一种应用视觉专家的大型多模态工业异常检测模型Myriad,可实现明确异常检测与高质量异常描述。具体而言,我们以MiniGPT-4为基础大模型,设计专家感知模块将视觉专家的先验知识编码为大型语言模型可理解的标记。为补偿视觉专家的错误与混淆,引入域适配器弥合通用图像与工业图像间的视觉表征差异。此外,提出视觉专家指令器,使Q-Former能根据视觉专家先验生成工业异常检测领域的视觉语言标记。在MVTec-AD和VisA基准上的大量实验表明,本方法不仅在单类和少样本设置下优于现有方法,还能在工业异常检测领域提供明确的异常预测与详细描述。