Recent advances in multimodal foundation models have set new standards in few-shot anomaly detection. This paper explores whether high-quality visual features alone are sufficient to rival existing state-of-the-art vision-language models. We affirm this by adapting DINOv2 for one-shot and few-shot anomaly detection, with a focus on industrial applications. We show that this approach does not only rival existing techniques but can even outmatch them in many settings. Our proposed vision-only approach, AnomalyDINO, is based on patch similarities and enables both image-level anomaly prediction and pixel-level anomaly segmentation. The approach is methodologically simple and training-free and, thus, does not require any additional data for fine-tuning or meta-learning. Despite its simplicity, AnomalyDINO achieves state-of-the-art results in one- and few-shot anomaly detection (e.g., pushing the one-shot performance on MVTec-AD from an AUROC of 93.1% to 96.6%). The reduced overhead, coupled with its outstanding few-shot performance, makes AnomalyDINO a strong candidate for fast deployment, for example, in industrial contexts.
翻译:近年来,多模态基础模型在少样本异常检测领域取得了突破性进展。本文探讨了仅凭高质量视觉特征是否足以媲美现有的先进视觉-语言模型。我们通过将DINOv2适配于单样本和少样本异常检测任务(重点关注工业应用场景)对此问题给出了肯定答案。研究表明,该方法不仅能够与现有技术相抗衡,甚至在多数设定下能实现更优性能。我们提出的纯视觉方法AnomalyDINO基于图像块相似度计算,可同时实现图像级异常预测与像素级异常分割。该方法原理简洁且无需训练,因此不需要任何额外数据进行微调或元学习。尽管设计简单,AnomalyDINO在单样本和少样本异常检测中均达到了最先进水平(例如将MVTec-AD数据集的单样本检测AUROC从93.1%提升至96.6%)。较低的计算开销与卓越的少样本性能相结合,使AnomalyDINO成为快速部署(如工业场景)的理想选择。