Recent years have seen a surge of interest in anomaly detection for tackling industrial defect detection, event detection, etc. However, existing unsupervised anomaly detectors, particularly those for the vision modality, face significant challenges due to redundant information and sparse latent space. Conversely, the language modality performs well due to its relatively single data. This paper tackles the aforementioned challenges for vision modality from a multimodal point of view. Specifically, we propose Cross-modal Guidance (CMG), which consists of Cross-modal Entropy Reduction (CMER) and Cross-modal Linear Embedding (CMLE), to tackle the redundant information issue and sparse space issue, respectively. CMER masks parts of the raw image and computes the matching score with the text. Then, CMER discards irrelevant pixels to make the detector focus on critical contents. To learn a more compact latent space for the vision anomaly detector, CMLE learns a correlation structure matrix from the language modality, and then the latent space of vision modality will be learned with the guidance of the matrix. Thereafter, the vision latent space will get semantically similar images closer. Extensive experiments demonstrate the effectiveness of the proposed methods. Particularly, CMG outperforms the baseline that only uses images by 16.81%. Ablation experiments further confirm the synergy among the proposed methods, as each component depends on the other to achieve optimal performance.
翻译:近年来,针对工业缺陷检测、事件检测等任务的异常检测引起了广泛关注。然而,现有的无监督异常检测器,特别是针对视觉模态的检测器,由于信息冗余和稀疏潜在空间而面临重大挑战。相比之下,语言模态因其相对单一的数据结构而表现良好。本文从多模态角度解决了上述视觉模态面临的挑战。具体而言,我们提出跨模态引导(Cross-modal Guidance,CMG),它由跨模态熵减少(Cross-modal Entropy Reduction,CMER)和跨模态线性嵌入(Cross-modal Linear Embedding,CMLE)组成,分别解决信息冗余问题和稀疏空间问题。CMER掩码原始图像的部分区域,并计算与文本的匹配得分,随后丢弃不相关的像素,使检测器聚焦于关键内容。为学习更紧凑的视觉异常检测器潜在空间,CMLE从语言模态中学习相关结构矩阵,并在此矩阵指导下学习视觉模态的潜在空间,从而使视觉潜在空间中语义相似的图像更接近。大量实验证明了所提方法的有效性。特别地,CMG相较于仅使用图像的基线方法提升了16.81%。消融实验进一步证实了所提方法之间的协同效应,各组件相互依赖以实现最优性能。