Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types (e.g., local and global defect), and the scarcity of training data. As such, it necessitates a comprehensive model capable of capturing both low-level and high-level features, even with limited data. To address this, we propose CLIPFUSION, a method that leverages both discriminative and generative foundation models. Specifically, the CLIP-based discriminative model excels at capturing global features, while the diffusion-based generative model effectively captures local details, creating a synergistic and complementary approach. Notably, we introduce a methodology for utilizing cross-attention maps and feature maps extracted from diffusion models specifically for anomaly detection. Experimental results on benchmark datasets (MVTec-AD, VisA) demonstrate that CLIPFUSION consistently outperforms baseline methods, achieving outstanding performance in both anomaly segmentation and classification. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.
翻译:异常检测是一个复杂的问题,原因在于异常定义的模糊性、异常类型的多样性(例如局部和全局缺陷)以及训练数据的稀缺性。因此,它需要一个即使在数据有限的情况下也能同时捕捉低级和高级特征的综合性模型。为解决此问题,我们提出了CLIPFUSION方法,该方法利用了判别式和生成式基础模型。具体而言,基于CLIP的判别式模型擅长捕捉全局特征,而基于扩散的生成式模型则能有效捕捉局部细节,从而形成一种协同互补的方法。值得注意的是,我们引入了一种专门针对异常检测、利用从扩散模型中提取的交叉注意力图和特征图的方法。在基准数据集(MVTec-AD, VisA)上的实验结果表明,CLIPFUSION在异常分割和分类任务中均持续优于基线方法,取得了优异的性能。我们相信,我们的方法凸显了多模态与多模型融合在应对异常检测多方面挑战中的有效性,为实际应用提供了一个可扩展的解决方案。