Deep Neural Networks remain inherently vulnerable to backdoor attacks. Traditional test-time defenses largely operate under the paradigm of internal diagnosis methods like model repairing or input robustness, yet these approaches are often fragile under advanced attacks as they remain entangled with the victim model's corrupted parameters. We propose a paradigm shift from Internal Diagnosis to External Semantic Auditing, arguing that effective defense requires decoupling safety from the victim model via an independent, semantically grounded auditor. To this end, we present a framework harnessing Universal Vision-Language Models (VLMs) as evolving semantic gatekeepers. We introduce PRISM (Prototype Refinement & Inspection via Statistical Monitoring), which overcomes the domain gap of general VLMs through two key mechanisms: a Hybrid VLM Teacher that dynamically refines visual prototypes online, and an Adaptive Router powered by statistical margin monitoring to calibrate gating thresholds in real-time. Extensive evaluation across 17 datasets and 11 attack types demonstrates that PRISM achieves state-of-the-art performance, suppressing Attack Success Rate to <1% on CIFAR-10 while improving clean accuracy, establishing a new standard for model-agnostic, externalized security.
翻译:深度神经网络本质上仍然容易受到后门攻击。传统的测试时防御主要遵循内部诊断方法的范式,例如模型修复或输入鲁棒性增强,但这些方法在高级攻击下往往表现脆弱,因为它们仍然与受害模型被污染的参数量纠缠不清。我们提出一种从内部诊断转向外部语义审计的范式转变,主张有效的防御需要通过一个独立的、基于语义的审计器将安全性与受害模型解耦。为此,我们提出了一个利用通用视觉-语言模型作为演进语义看门人的框架。我们引入了PRISM(通过统计监控的原型精炼与检查),它通过两个关键机制克服了通用VLM的领域鸿沟:一个在线动态精炼视觉原型的混合VLM教师,以及一个由统计边界监控驱动的自适应路由器,用于实时校准门控阈值。在17个数据集和11种攻击类型上的广泛评估表明,PRISM实现了最先进的性能,在CIFAR-10上将攻击成功率抑制至<1%,同时提高了干净样本的准确率,为模型无关的外部化安全设立了新标准。