AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

Anomaly synthesis is a crucial approach to augment abnormal data for advancing anomaly inspection. Based on the knowledge from the large-scale pre-training, existing text-to-image anomaly synthesis methods predominantly focus on textual information or coarse-aligned visual features to guide the entire generation process. However, these methods often lack sufficient descriptors to capture the complicated characteristics of realistic anomalies (e.g., the fine-grained visual pattern of anomalies), limiting the realism and generalization of the generation process. To this end, we propose a novel anomaly synthesis framework called AnomalyControl to learn cross-modal semantic features as guidance signals, which could encode the generalized anomaly cues from text-image reference prompts and improve the realism of synthesized abnormal samples. Specifically, AnomalyControl adopts a flexible and non-matching prompt pair (i.e., a text-image reference prompt and a targeted text prompt), where a Cross-modal Semantic Modeling (CSM) module is designed to extract cross-modal semantic features from the textual and visual descriptors. Then, an Anomaly-Semantic Enhanced Attention (ASEA) mechanism is formulated to allow CSM to focus on the specific visual patterns of the anomaly, thus enhancing the realism and contextual relevance of the generated anomaly features. Treating cross-modal semantic features as the prior, a Semantic Guided Adapter (SGA) is designed to encode effective guidance signals for the adequate and controllable synthesis process. Extensive experiments indicate that AnomalyControl can achieve state-of-the-art results in anomaly synthesis compared with existing methods while exhibiting superior performance for downstream tasks.

翻译：异常合成是扩充异常数据以推进异常检测的关键方法。基于大规模预训练的知识，现有的文本到图像异常合成方法主要依赖文本信息或粗对齐的视觉特征来指导整个生成过程。然而，这些方法通常缺乏足够的描述符来捕捉真实异常的复杂特征（例如异常的细粒度视觉模式），限制了生成过程的真实性和泛化能力。为此，我们提出了一种名为AnomalyControl的新型异常合成框架，通过学习跨模态语义特征作为引导信号，该信号能够从文本-图像参考提示中编码泛化的异常线索，并提升合成异常样本的真实性。具体而言，AnomalyControl采用灵活的非匹配提示对（即一个文本-图像参考提示和一个目标文本提示），其中设计了一个跨模态语义建模（CSM）模块，用于从文本和视觉描述符中提取跨模态语义特征。随后，构建了一种异常语义增强注意力（ASEA）机制，使CSM能够聚焦于异常的特定视觉模式，从而增强生成异常特征的真实性和上下文相关性。以跨模态语义特征为先验，设计了一个语义引导适配器（SGA），用于为充分且可控的合成过程编码有效的引导信号。大量实验表明，与现有方法相比，AnomalyControl在异常合成中能够取得最先进的结果，同时在下游任务中展现出优越性能。