We propose general visual inspection model using Vision-Language Model~(VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our method achieves high performance, with MCC of 0.804 and F1-score of 0.950 on MVTec AD in a one-shot manner. Our code is available at~https://github.com/ia-gu/Vision-Language-In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Model.
翻译:我们提出了一种通用视觉检测模型,该模型利用视觉语言模型(VLM),结合少量合格或缺陷产品的图像以及作为检测标准的解释性文本。尽管现有的VLM在各种任务中表现出高性能,但它们并未针对视觉检测等特定任务进行训练。因此,我们构建了一个数据集,包含从网络收集的多样化合格与缺陷产品图像,以及统一格式的输出文本,并对VLM进行微调。对于新产品,我们的方法采用上下文学习,使模型能够通过一个合格或缺陷图像的示例、相应的解释性文本以及视觉提示进行检测。这种方法无需为每种产品收集大量训练样本并重新训练模型。实验结果表明,我们的方法在MVTec AD数据集上以单样本方式取得了高性能,MCC为0.804,F1分数为0.950。我们的代码可在~https://github.com/ia-gu/Vision-Language-In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Model 获取。