CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion

Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, "Is the delivered model backdoored or not?" To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI's reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.

翻译：数据与计算资源受限的组织日益将模型训练外包给机器学习即服务（MLaaS）提供商，后者通过提示调优而非从头训练的方式，将CLIP等视觉语言模型（VLM）适配至下游任务。这种半诚实场景产生了安全风险：恶意提供商可遵循提示调优协议同时植入后门，迫使带触发器输入被分类至攻击者指定类别——即便对于分布外（OOD）数据亦然。此类后门不修改编码器，使得聚焦于编码器受损的现有检测方法失效。其他数据级方法（在训练前净化数据或在推理阶段处理）也无法回答关键问题：“交付的模型是否被植入了后门？”为应对这一模型级验证难题，我们提出CLIP-Inspector（CI）——一种专为提示调优CLIP模型设计的后门检测方法。假设可白盒访问交付模型及一批未标注OOD图像，CI为每个类别重构可能触发器，以判定模型是否展现后门行为。进一步研究表明，利用CI重构的触发器对正确标注的带触发输入进行微调，可重新对齐模型并降低后门有效性。通过在十个数据集和四种后门攻击上的大量实验，我们证明：仅使用1000张OOD图像单轮训练即可重构有效触发器，检测准确率达94%（47/50个模型）。与适配后的触发器逆向基线方法相比，CI的AUROC评分显著更高（0.973对比0.495/0.687），从而为提示调优CLIP模型提供审查与事后修复能力，确保安全部署。