Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.
翻译:预训练视觉语言模型(VLMs)在零样本异常检测(ZSAD)任务中面临一个关键的适应鸿沟:它们缺乏密集预测所需的局部归纳偏置,并采用了不灵活的特征融合范式。我们通过一个架构协同设计框架来解决这些局限性,该框架联合优化特征表征和跨模态融合。我们的方法提出了一种参数高效的卷积低秩适应(Conv-LoRA)适配器,以注入局部归纳偏置,实现细粒度表征;并引入了一个动态融合网关(DFG),该网关利用视觉上下文自适应地调制文本提示,从而实现强大的双向融合。在多样化的工业和医学基准测试上进行的大量实验证明了其卓越的准确性和鲁棒性,验证了这种协同设计对于将基础模型稳健地适应于密集感知任务至关重要。源代码可在 https://github.com/cockmake/ACD-CLIP 获取。