Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, SPECTRE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, SPECTRE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), SPECTRE achieves up to 70% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/CAIN. WARNING: Our paper contains examples that might be sensitive to the readers!
翻译:大型语言模型(LLMs)正日益通过从公共市场下载的第三方系统提示词进行部署。我们发现了一个关键供应链漏洞:条件性系统提示词投毒,即攻击者将“潜伏代理”注入看似良性的提示词中。与传统旨在广泛突破拒绝机制的越狱攻击不同,我们提出的框架SPECTRE通过优化系统提示词,仅在特定查询(例如“我应该投票给谁当选美国总统?”)时触发LLMs输出目标性、受操控的响应,同时在良性输入上保持高可用性。SPECTRE在严格的黑盒环境下运行(无需模型权重访问),采用两阶段优化策略,包括全局语义搜索和贪婪词汇精炼。在开源模型和商业API(GPT-4o-mini、GPT-3.5)上的测试表明,SPECTRE在目标查询上实现了高达70%的F1分数下降,而对通用能力的影响极小。我们进一步证明,这些被投毒的提示词通过利用现实世界系统提示词中存在的自然噪声,能够规避包括困惑度过滤器和拼写纠正在内的标准防御机制。我们的代码和数据可在https://github.com/vietph34/CAIN获取。警告:本文包含可能引起读者敏感的内容示例!