Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

Vision-Language Models (VLMs) are increasingly deployed in consumer applications where users seek recommendations about products, dining, and services. We introduce Hidden Ads, a new class of backdoor attacks that exploit this recommendation-seeking behavior to inject unauthorized advertisements. Unlike traditional pattern-triggered backdoors that rely on artificial triggers such as pixel patches or special tokens, Hidden Ads activates on natural user behaviors: when users upload images containing semantic content of interest (e.g., food, cars, animals) and ask recommendation-seeking questions, the backdoored model provides correct, helpful answers while seamlessly appending attacker-specified promotional slogans. This design preserves model utility and produces natural-sounding injections, making the attack practical for real-world deployment in consumer-facing recommendation services. We propose a multi-tier threat framework to systematically evaluate Hidden Ads across three adversary capability levels: hard prompt injection, soft prompt optimization, and supervised fine-tuning. Our poisoned data generation pipeline uses teacher VLM-generated chain-of-thought reasoning to create natural trigger--slogan associations across multiple semantic domains. Experiments on three VLM architectures demonstrate that Hidden Ads achieves high injection efficacy with near-zero false positives while maintaining task accuracy. Ablation studies confirm that the attack is data-efficient, transfers effectively to unseen datasets, and scales to multiple concurrent domain-slogan pairs. We evaluate defenses including instruction-based filtering and clean fine-tuning, finding that both fail to remove the backdoor without causing significant utility degradation.

翻译：视觉语言模型（VLM）正日益部署于消费者应用中，用于帮助用户获取产品、餐饮及服务相关的推荐。我们提出"隐式广告"——一种利用此类推荐寻求行为注入未授权广告的新型后门攻击。与传统依赖像素补丁或特殊令牌等人工触发的模式触发型后门不同，"隐式广告"在自然用户行为中被激活：当用户上传包含食品、汽车、动物等语义兴趣内容的图像并询问推荐类问题时，被植入后门的模型在提供准确有用回答的同时，会无缝附加攻击者指定的促销标语。这种设计既维持了模型效用，又使注入内容听起来自然流畅，使该攻击在面向消费者的推荐服务真实部署场景中具有实用性。我们提出多层级威胁框架，从硬提示注入、软提示优化和监督微调三种攻击者能力层级系统评估"隐式广告"。所提出的投毒数据生成流程利用教师VLM生成的思维链推理，在多个语义域中创建自然触发器-标语关联。在三种VLM架构上的实验表明，"隐式广告"在保持任务准确率的同时，实现了高注入效率与近乎零误报。消融研究证实该攻击具有数据高效性、能有效迁移至未见数据集，并可扩展至多组并发域-标语对。我们评估了包括指令过滤和干净微调在内的防御手段，发现两者均无法在避免显著效用损失的前提下移除该后门。